PSP as an admission controller resource is being deprecated. Deployed PodSecurityPolicy's will keep working until version 1.25, their target removal from the codebase. A new feature, with a working title of "PSP replacement policy", is being developed in KEP-2579. To learn more, read PodSecurityPolicy Deprecation: Past, Present, and Future.
Kustomize version in kubectl had a jump from v2.0.3 to v4.0.5. Kustomize is now treated as a library and future updates will be less sporadic.
Default Container Labels
Pod with multiple containers can use kubectl.kubernetes.io/default-container label to have a container preselected for kubectl commands. More can be read in KEP-2227.
Immutable Secrets and ConfigMaps
Immutable Secrets and ConfigMaps graduates to GA. This feature allows users to specify that the contents of a particular Secret or ConfigMap is immutable for its object lifetime. For such instances, Kubelet will not watch/poll for changes and therefore reducing apiserver load.
Structured Logging in Kubelet
Kubelet has adopted structured logging, thanks to community effort in accomplishing this within the release timeline. Structured logging in the project remains an ongoing effort -- for folks interested in participating, keep an eye / chime in to the mailing list discussion.
Storage Capacity Tracking
Traditionally, the Kubernetes scheduler was based on the assumptions that additional persistent storage is available everywhere in the cluster and has infinite capacity. Topology constraints addressed the first point, but up to now pod scheduling was still done without considering that the remaining storage capacity may not be enough to start a new pod. Storage capacity tracking addresses that by adding an API for a CSI driver to report storage capacity and uses that information in the Kubernetes scheduler when choosing a node for a pod. This feature serves as a stepping stone for supporting dynamic provisioning for local volumes and other volume types that are more capacity constrained.
Generic Ephemeral Volumes
Generic ephermeral volumes feature allows any existing storage driver that supports dynamic provisioning to be used as an ephemeral volume with the volume’s lifecycle bound to the Pod. It can be used to provide scratch storage that is different from the root disk, for example persistent memory, or a separate local disk on that node. All StorageClass parameters for volume provisioning are supported. All features supported with PersistentVolumeClaims are supported, such as storage capacity tracking, snapshots and restore, and volume resizing.
CSI Service Account Token
CSI Service Account Token feature moves to Beta in 1.21. This feature improves the security posture and allows CSI drivers to receive pods' bound service account tokens. This feature also provides a knob to re-publish volumes so that short-lived volumes can be refreshed.
CSI Health Monitoring
The CSI health monitoring feature is being released as a second Alpha in Kubernetes 1.21. This feature enables CSI Drivers to share abnormal volume conditions from the underlying storage systems with Kubernetes so that they can be reported as events on PVCs or Pods. This feature serves as a stepping stone towards programmatic detection and resolution of individual volume health issues by Kubernetes.
Known Issues
TopologyAwareHints feature falls back to default behavior
The feature gate currently falls back to the default behavior in most cases. Enabling the feature gate will add hints to EndpointSlices, but functional differences are only observed in non-dual stack kube-proxy implementation. The fix will be available in coming releases.
Urgent Upgrade Notes
(No, really, you MUST read this before you upgrade)
Kube-proxy's IPVS proxy mode no longer sets the net.ipv4.conf.all.route_localnet sysctl parameter. Nodes upgrading will have net.ipv4.conf.all.route_localnet set to 1 but new nodes will inherit the system default (usually 0). If you relied on any behavior requiring net.ipv4.conf.all.route_localnet, you must set ensure it is enabled as kube-proxy will no longer set it automatically. This change helps to further mitigate CVE-2020-8558. (#92938, @lbernail) [SIG Network and Release]
Kubeadm: during "init" an empty cgroupDriver value in the KubeletConfiguration is now always set to "systemd" unless the user is explicit about it. This requires existing machine setups to configure the container runtime to use the "systemd" driver. Documentation on this topic can be found here: https://kubernetes.io/docs/setup/production-environment/container-runtimes/. When upgrading existing clusters / nodes using "kubeadm upgrade" the old cgroupDriver value is preserved, but in 1.22 this change will also apply to "upgrade". For more information on migrating to the "systemd" driver or remaining on the "cgroupfs" driver see: https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/. (#99471, @neolit123) [SIG Cluster Lifecycle]
Newly provisioned PVs by EBS plugin will no longer use the deprecated "failure-domain.beta.kubernetes.io/zone" and "failure-domain.beta.kubernetes.io/region" labels. It will use "topology.kubernetes.io/zone" and "topology.kubernetes.io/region" labels instead. (#99130, @ayberk) [SIG Cloud Provider, Storage and Testing]
Newly provisioned PVs by OpenStack Cinder plugin will no longer use the deprecated "failure-domain.beta.kubernetes.io/zone" and "failure-domain.beta.kubernetes.io/region" labels. It will use "topology.kubernetes.io/zone" and "topology.kubernetes.io/region" labels instead. (#99719, @jsafrane) [SIG Cloud Provider and Storage]
Newly provisioned PVs by gce-pd will no longer have the beta FailureDomain label. gce-pd volume plugin will start to have GA topology label instead. (#98700, @Jiawei0227) [SIG Cloud Provider, Storage and Testing]
OpenStack Cinder CSI migration is on by default, Clinder CSI driver must be installed on clusters on OpenStack for Cinder volumes to work. (#98538, @dims) [SIG Storage]
Remove alpha CSIMigrationXXComplete flag and add alpha InTreePluginXXUnregister flag. Deprecate CSIMigrationvSphereComplete flag and it will be removed in v1.22. (#98243, @Jiawei0227)
Remove storage metrics storage_operation_errors_total, since we already have storage_operation_status_count.And add new field status for storage_operation_duration_seconds, so that we can know about all status storage operation latency. (#98332, @JornShen) [SIG Instrumentation and Storage]
The metric storage_operation_errors_total is not removed, but is marked deprecated, and the metric storage_operation_status_count is marked deprecated. In both cases the storage_operation_duration_seconds metric can be used to recover equivalent counts (using status=fail-unknown in the case of storage_operations_errors_total). (#99045, @mattcary)
ServiceNodeExclusion, NodeDisruptionExclusion and LegacyNodeRoleBehavior features have been promoted to GA. ServiceNodeExclusion and NodeDisruptionExclusion are now unconditionally enabled, while LegacyNodeRoleBehavior is unconditionally disabled. To prevent control plane nodes from being added to load balancers automatically, upgrade users need to add "node.kubernetes.io/exclude-from-external-load-balancers" label to control plane nodes. (#97543, @pacoxu)
Changes by Kind
Deprecation
Aborting the drain command in a list of nodes will be deprecated. The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain. For now, users can try such experience by enabling --ignore-errors flag. (#98203, @yuzhiquan)
Delete deprecated service.beta.kubernetes.io/azure-load-balancer-mixed-protocols mixed procotol annotation in favor of the MixedProtocolLBService feature (#97096, @nilo19) [SIG Cloud Provider]
Deprecate the topologyKeys field in Service. This capability will be replaced with upcoming work around Topology Aware Subsetting and Service Internal Traffic Policy. (#96736, @andrewsykim) [SIG Apps]
Kube-proxy: remove deprecated --cleanup-ipvs flag of kube-proxy, and make --cleanup flag always to flush IPVS (#97336, @maaoBit) [SIG Network]
Kubeadm: deprecated command "alpha selfhosting pivot" is now removed. (#97627, @knight42)
Kubeadm: graduate the command kubeadm alpha kubeconfig user to kubeadm kubeconfig user. The kubeadm alpha kubeconfig user command is deprecated now. (#97583, @knight42) [SIG Cluster Lifecycle]
Kubeadm: the "kubeadm alpha certs" command is removed now, please use "kubeadm certs" instead. (#97706, @knight42) [SIG Cluster Lifecycle]
Kubeadm: the deprecated kube-dns is no longer supported as an option. If "ClusterConfiguration.dns.type" is set to "kube-dns" kubeadm will now throw an error. (#99646, @rajansandeep) [SIG Cluster Lifecycle]
Kubectl: The deprecated kubectl alpha debug command is removed. Use kubectl debug instead. (#98111, @pandaamanda) [SIG CLI]
Official support to build kubernetes with docker-machine / remote docker is removed. This change does not affect building kubernetes with docker locally. (#97935, @adeniyistephen) [SIG Release and Testing]
Remove deprecated --generator, --replicas, --service-generator, --service-overrides, --schedule from kubectl run
Deprecate --serviceaccount, --hostport, --requests, --limits in kubectl run (#99732, @soltysh)
Remove the deprecated metrics "scheduling_algorithm_preemption_evaluation_seconds" and "binding_duration_seconds", suggest to use "scheduler_framework_extension_point_duration_seconds" instead. (#96447, @chendave) [SIG Cluster Lifecycle, Instrumentation, Scheduling and Testing]
Removing experimental windows container hyper-v support with Docker (#97141, @wawa0210) [SIG Node and Windows]
Rename metrics etcd_object_counts to apiserver_storage_object_counts and mark it as stable. The original etcd_object_counts metrics name is marked as "Deprecated" and will be removed in the future. (#99785, @erain) [SIG API Machinery, Instrumentation and Testing]
The GA TokenRequest and TokenRequestProjection feature gates have been removed and are unconditionally enabled. Remove explicit use of those feature gates in CLI invocations. (#97148, @wawa0210) [SIG Node]
The PodSecurityPolicy API is deprecated in 1.21, and will no longer be served starting in 1.25. (#97171, @deads2k) [SIG Auth and CLI]
The batch/v2alpha1 CronJob type definitions and clients are deprecated and removed. (#96987, @soltysh) [SIG API Machinery, Apps, CLI and Testing]
The export query parameter (inconsistently supported by API resources and deprecated in v1.14) is fully removed. Requests setting this query parameter will now receive a 400 status response. (#98312, @deads2k) [SIG API Machinery, Auth and Testing]
audit.k8s.io/v1beta1 and audit.k8s.io/v1alpha1 audit policy configuration and audit events are deprecated in favor of audit.k8s.io/v1, available since v1.13. kube-apiserver invocations that specify alpha or beta policy configurations with --audit-policy-file, or explicitly request alpha or beta audit events with --audit-log-version / --audit-webhook-version must update to use audit.k8s.io/v1 and accept audit.k8s.io/v1 events prior to v1.24. (#98858, @carlory) [SIG Auth]
discovery.k8s.io/v1beta1 EndpointSlices are deprecated in favor of discovery.k8s.io/v1, and will no longer be served in Kubernetes v1.25. (#100472, @liggitt)
diskformat storage class parameter for in-tree vSphere volume plugin is deprecated as of v1.21 release. Please consider updating storageclass and remove diskformat parameter. vSphere CSI Driver does not support diskformat storageclass parameter.
vSphere releases less than 67u3 are deprecated as of v1.21. Please consider upgrading vSphere to 67u3 or above. vSphere CSI Driver requires minimum vSphere 67u3.
VM Hardware version less than 15 is deprecated as of v1.21. Please consider upgrading the Node VM Hardware version to 15 or above. vSphere CSI Driver recommends Node VM's Hardware version set to at least vmx-15.
Multi vCenter support is deprecated as of v1.21. If you have a Kubernetes cluster spanning across multiple vCenter servers, please consider moving all k8s nodes to a single vCenter Server. vSphere CSI Driver does not support Kubernetes deployment spanning across multiple vCenter servers.
Support for these deprecations will be available till Kubernetes v1.24. (#98546, @divyenpatel)
API Change
PodAffinityTerm includes a namespaceSelector field to allow selecting eligible namespaces based on their labels.
A new CrossNamespacePodAffinity quota scope API that allows restricting which namespaces allowed to use PodAffinityTerm with corss-namespace reference via namespaceSelector or namespaces fields. (#98582, @ahg-g) [SIG API Machinery, Apps, Auth and Testing]
Add Probe-level terminationGracePeriodSeconds field (#99375, @ehashman) [SIG API Machinery, Apps, Node and Testing]
Added .spec.completionMode field to Job, with accepted values NonIndexed (default) and Indexed. This is an alpha field and is only honored by servers with the IndexedJob feature gate enabled. (#98441, @alculquicondor) [SIG Apps and CLI]
Adds support for endPort field in NetworkPolicy (#97058, @rikatz) [SIG Apps and Network]
CSIServiceAccountToken graduates to Beta and enabled by default. (#99298, @zshihang)
Cluster admins can now turn off /debug/pprof and /debug/flags/v endpoint in kubelet by setting enableProfilingHandler and enableDebugFlagsHandler to false in the Kubelet configuration file. Options enableProfilingHandler and enableDebugFlagsHandler can be set to true only when enableDebuggingHandlers is also set to true. (#98458, @SaranBalaji90)
DaemonSets accept a MaxSurge integer or percent on their rolling update strategy that will launch the updated pod on nodes and wait for those pods to go ready before marking the old out-of-date pods as deleted. This allows workloads to avoid downtime during upgrades when deployed using DaemonSets. This feature is alpha and is behind the DaemonSetUpdateSurge feature gate. (#96441, @smarterclayton) [SIG Apps and Testing]
Enable SPDY pings to keep connections alive, so that kubectl exec and kubectl portforward won't be interrupted. (#97083, @knight42) [SIG API Machinery and CLI]
FieldManager no longer owns fields that get reset before the object is persisted (e.g. "status wiping"). (#99661, @kevindelgado) [SIG API Machinery, Auth and Testing]
Generic ephemeral volumes are beta. (#99643, @pohly) [SIG API Machinery, Apps, Auth, CLI, Node, Storage and Testing]
Hugepages request values are limited to integer multiples of the page size. (#98515, @lala123912) [SIG Apps]
Implement the GetAvailableResources in the podresources API. (#95734, @fromanirh) [SIG Instrumentation, Node and Testing]
IngressClass resource can now reference a resource in a specific namespace
for implementation-specific configuration (previously only Cluster-level resources were allowed).
This feature can be enabled using the IngressClassNamespacedParams feature gate. (#99275, @hbagdi)
Jobs API has a new .spec.suspend field that can be used to suspend and resume Jobs. This is an alpha field which is only honored by servers with the SuspendJob feature gate enabled. (#98727, @adtac)
Kubelet Graceful Node Shutdown feature graduates to Beta and enabled by default. (#99735, @bobbypage)
Kubernetes is now built using go1.15.7 (#98363, @cpanato) [SIG Cloud Provider, Instrumentation, Node, Release and Testing]
Namespace API objects now have a kubernetes.io/metadata.name label matching their metadata.name field to allow selecting any namespace by its name using a label selector. (#96968, @jayunit100) [SIG API Machinery, Apps, Cloud Provider, Storage and Testing]
One new field "InternalTrafficPolicy" in Service is added.
It specifies if the cluster internal traffic should be routed to all endpoints or node-local endpoints only.
"Cluster" routes internal traffic to a Service to all endpoints.
"Local" routes traffic to node-local endpoints only, and traffic is dropped if no node-local endpoints are ready.
The default value is "Cluster". (#96600, @maplain) [SIG API Machinery, Apps and Network]
PodDisruptionBudget API objects can now contain conditions in status. (#98127, @mortent) [SIG API Machinery, Apps, Auth, CLI, Cloud Provider, Cluster Lifecycle and Instrumentation]
PodSecurityPolicy only stores "generic" as allowed volume type if the GenericEphemeralVolume feature gate is enabled (#98918, @pohly) [SIG Auth and Security]
Promote CronJobs to batch/v1 (#99423, @soltysh) [SIG API Machinery, Apps, CLI and Testing]
Promote Immutable Secrets/ConfigMaps feature to Stable. This allows to set immutable field in Secret or ConfigMap object to mark their contents as immutable. (#97615, @wojtek-t) [SIG Apps, Architecture, Node and Testing]
Remove support for building Kubernetes with bazel. (#99561, @BenTheElder) [SIG API Machinery, Apps, Architecture, Auth, Autoscaling, CLI, Cloud Provider, Cluster Lifecycle, Instrumentation, Network, Node, Release, Scalability, Scheduling, Storage, Testing and Windows]
Scheduler extender filter interface now can report unresolvable failed nodes in the new field FailedAndUnresolvableNodes of ExtenderFilterResult struct. Nodes in this map will be skipped in the preemption phase. (#92866, @cofyc) [SIG Scheduling]
Services can specify loadBalancerClass to use a custom load balancer (#98277, @XudongLiuHarold)
Storage capacity tracking (= the CSIStorageCapacity feature) graduates to Beta and enabled by default, storage.k8s.io/v1alpha1/VolumeAttachment and storage.k8s.io/v1alpha1/CSIStorageCapacity objects are deprecated (#99641, @pohly)
Support for Indexed Job: a Job that is considered completed when Pods associated to indexes from 0 to (.spec.completions-1) have succeeded. (#98812, @alculquicondor) [SIG Apps and CLI]
The BoundServiceAccountTokenVolume feature has been promoted to beta, and enabled by default.
This changes the tokens provided to containers at /var/run/secrets/kubernetes.io/serviceaccount/token to be time-limited, auto-refreshed, and invalidated when the containing pod is deleted.
Clients should reload the token from disk periodically (once per minute is recommended) to ensure they continue to use a valid token. k8s.io/client-go version v11.0.0+ and v0.15.0+ reload tokens automatically.
By default, injected tokens are given an extended lifetime so they remain valid even after a new refreshed token is provided. The metric serviceaccount_stale_tokens_total can be used to monitor for workloads that are depending on the extended lifetime and are continuing to use tokens even after a refreshed token is provided to the container. If that metric indicates no existing workloads are depending on extended lifetimes, injected token lifetime can be shortened to 1 hour by starting kube-apiserver with --service-account-extend-token-expiration=false. (#95667, @zshihang) [SIG API Machinery, Auth, Cluster Lifecycle and Testing]
The EndpointSlice Controllers are now GA. The EndpointSliceController will not populate the deprecatedTopology field and will only provide topology information through the zone and nodeName fields. (#99870, @swetharepakula)
The Endpoints controller will now set the endpoints.kubernetes.io/over-capacity annotation to "warning" when an Endpoints resource contains more than 1000 addresses. In a future release, the controller will truncate Endpoints that exceed this limit. The EndpointSlice API can be used to support significantly larger number of addresses. (#99975, @robscott) [SIG Apps and Network]
The PodDisruptionBudget API has been promoted to policy/v1 with no schema changes. The only functional change is that an empty selector ({}) written to a policy/v1 PodDisruptionBudget now selects all pods in the namespace. The behavior of the policy/v1beta1 API remains unchanged. The policy/v1beta1 PodDisruptionBudget API is deprecated and will no longer be served in 1.25+. (#99290, @mortent) [SIG API Machinery, Apps, Auth, Autoscaling, CLI, Cloud Provider, Cluster Lifecycle, Instrumentation, Scheduling and Testing]
The EndpointSlice API is now GA. The EndpointSlice topology field has been removed from the GA API and will be replaced by a new per Endpoint Zone field. If the topology field was previously used, it will be converted into an annotation in the v1 Resource. The discovery.k8s.io/v1alpha1 API is removed. (#99662, @swetharepakula)
The controller.kubernetes.io/pod-deletion-cost annotation can be set to offer a hint on the cost of deleting a Pod compared to other pods belonging to the same ReplicaSet. Pods with lower deletion cost are deleted first. This is an alpha feature. (#99163, @ahg-g)
The kube-apiserver now resets managedFields that got corrupted by a mutating admission controller. (#98074, @kwiesmueller)
Topology Aware Hints are now available in alpha and can be enabled with the TopologyAwareHints feature gate. (#99522, @robscott) [SIG API Machinery, Apps, Auth, Instrumentation, Network and Testing]
Users might specify the kubectl.kubernetes.io/default-exec-container annotation in a Pod to preselect container for kubectl commands. (#97099, @pacoxu) [SIG CLI]
Feature
A client-go metric, rest_client_exec_plugin_call_total, has been added to track total calls to client-go credential plugins. (#98892, @ankeesler) [SIG API Machinery, Auth, Cluster Lifecycle and Instrumentation]
A new histogram metric to track the time it took to delete a job by the TTLAfterFinished controller (#98676, @ahg-g)
AWS cloud provider supports auto-discovering subnets without any kubernetes.io/cluster/<clusterName> tags. It also supports additional service annotation service.beta.kubernetes.io/aws-load-balancer-subnets to manually configure the subnets. (#97431, @kishorj)
Aborting the drain command in a list of nodes will be deprecated. The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain. For now, users can try such experience by enabling --ignore-errors flag. (#98203, @yuzhiquan)
Add --permit-address-sharing flag to kube-apiserver to listen with SO_REUSEADDR. While allowing to listen on wildcard IPs like 0.0.0.0 and specific IPs in parallel, it avoids waiting for the kernel to release socket in TIME_WAIT state, and hence, considerably reducing kube-apiserver restart times under certain conditions. (#93861, @sttts)
Add csi_operations_seconds metric on kubelet that exposes CSI operations duration and status for node CSI operations. (#98979, @Jiawei0227) [SIG Instrumentation and Storage]
Add migrated field into storage_operation_duration_seconds metric (#99050, @Jiawei0227) [SIG Apps, Instrumentation and Storage]
Add flag --lease-reuse-duration-seconds for kube-apiserver to config etcd lease reuse duration. (#97009, @lingsamuel) [SIG API Machinery and Scalability]
Add metric etcd_lease_object_counts for kube-apiserver to observe max objects attached to a single etcd lease. (#97480, @lingsamuel) [SIG API Machinery, Instrumentation and Scalability]
Add support to generate client-side binaries for new darwin/arm64 platform (#97743, @dims) [SIG Release and Testing]
Added ephemeral_volume_controller_create[_failures]_total counters to kube-controller-manager metrics (#99115, @pohly) [SIG API Machinery, Apps, Cluster Lifecycle, Instrumentation and Storage]
Added support for installing arm64 node artifacts. (#99242, @liu-cong)
Adds alpha feature VolumeCapacityPriority which makes the scheduler prioritize nodes based on the best matching size of statically provisioned PVs across multiple topologies. (#96347, @cofyc) [SIG Apps, Network, Scheduling, Storage and Testing]
Adds the ability to pass --strict-transport-security-directives to the kube-apiserver to set the HSTS header appropriately. Be sure you understand the consequences to browsers before setting this field. (#96502, @249043822) [SIG Auth]
Adds two new metrics to cronjobs, a histogram to track the time difference when a job is created and the expected time when it should be created, as well as a gauge for the missed schedules of a cronjob (#99341, @alaypatel07)
Alpha implementation of Kubectl Command Headers: SIG CLI KEP 859 enabled when KUBECTL_COMMAND_HEADERS environment variable set on the client command line. (#98952, @seans3)
Base-images: Update to debian-iptables:buster-v1.4.0
CRIContainerLogRotation graduates to GA and unconditionally enabled. (#99651, @umohnani8)
Component owner can configure the allowlist of metric label with flag '--allow-metric-labels'. (#99385, @YoyinZyc) [SIG API Machinery, CLI, Cloud Provider, Cluster Lifecycle, Instrumentation and Release]
Component owner can configure the allowlist of metric label with flag '--allow-metric-labels'. (#99738, @YoyinZyc) [SIG API Machinery, Cluster Lifecycle and Instrumentation]
EmptyDir memory backed volumes are sized as the the minimum of pod allocatable memory on a host and an optional explicit user provided value. (#100319, @derekwaynecarr) [SIG Node]
Enables Kubelet to check volume condition and log events to corresponding pods. (#99284, @fengzixu) [SIG Apps, Instrumentation, Node and Storage]
EndpointSliceNodeName graduates to GA and thus will be unconditionally enabled -- NodeName will always be available in the v1beta1 API. (#99746, @swetharepakula)
Export NewDebuggingRoundTripper function and DebugLevel options in the k8s.io/client-go/transport package. (#98324, @atosatto)
Kube-proxy iptables: new metric sync_proxy_rules_iptables_total that exposes the number of rules programmed per table in each iteration (#99653, @aojea) [SIG Instrumentation and Network]
Kube-scheduler now logs plugin scoring summaries at --v=4 (#99411, @damemi) [SIG Scheduling]
Kubeadm now includes CoreDNS v1.8.0. (#96429, @rajansandeep) [SIG Cluster Lifecycle]
Kubeadm: IPv6DualStack feature gate graduates to Beta and enabled by default (#99294, @pacoxu)
Kubeadm: a warning to user as ipv6 site-local is deprecated (#99574, @pacoxu) [SIG Cluster Lifecycle and Network]
Kubeadm: add support for certificate chain validation. When using kubeadm in external CA mode, this allows an intermediate CA to be used to sign the certificates. The intermediate CA certificate must be appended to each signed certificate for this to work correctly. (#97266, @robbiemcmichael) [SIG Cluster Lifecycle]
Kubeadm: amend the node kernel validation to treat CGROUP_PIDS, FAIR_GROUP_SCHED as required and CFS_BANDWIDTH, CGROUP_HUGETLB as optional (#96378, @neolit123) [SIG Cluster Lifecycle and Node]
Kubeadm: apply the "node.kubernetes.io/exclude-from-external-load-balancers" label on control plane nodes during "init", "join" and "upgrade" to preserve backwards compatibility with the lagacy LB mode where nodes labeled as "master" where excluded. To opt-out you can remove the label from a node. See #97543 and the linked KEP for more details. (#98269, @neolit123) [SIG Cluster Lifecycle]
Kubeadm: if the user has customized their image repository via the kubeadm configuration, pass the custom pause image repository and tag to the kubelet via --pod-infra-container-image not only for Docker but for all container runtimes. This flag tells the kubelet that it should not garbage collect the image. (#99476, @neolit123) [SIG Cluster Lifecycle]
Kubeadm: perform pre-flight validation on host/node name upon kubeadm init and kubeadm join, showing warnings on non-compliant names (#99194, @pacoxu)
Kubectl version changed to write a warning message to stderr if the client and server version difference exceeds the supported version skew of +/-1 minor version. (#98250, @brianpursley) [SIG CLI]
Kubectl: Add --use-protocol-buffers flag to kubectl top pods and nodes. (#96655, @serathius)
Kubectl: kubectl get will omit managed fields by default now. Users could set --show-managed-fields to true to show managedFields when the output format is either json or yaml. (#96878, @knight42) [SIG CLI and Testing]
Kubectl: a Pod can be preselected as default container using kubectl.kubernetes.io/default-container annotation (#99833, @mengjiao-liu)
Kubectl: add bash-completion for comma separated list on kubectl get (#98301, @phil9909)
Kubernetes is now built using go1.15.8 (#98834, @cpanato) [SIG Cloud Provider, Instrumentation, Release and Testing]
Kubernetes is now built with Golang 1.16 (#98572, @justaugustus) [SIG API Machinery, Auth, CLI, Cloud Provider, Cluster Lifecycle, Instrumentation, Node, Release and Testing]
Kubernetes is now built with Golang 1.16.1 (#100106, @justaugustus) [SIG Cloud Provider, Instrumentation, Release and Testing]
Metrics can now be disabled explicitly via a command line flag (i.e. '--disabled-metrics=metric1,metric2') (#99217, @logicalhan)
New admission controller DenyServiceExternalIPs is available. Clusters which do not need the Service externalIPs feature should enable this controller and be more secure. (#97395, @thockin)
Overall, enable the feature of PreferNominatedNode will improve the performance of scheduling where preemption might frequently happen, but in theory, enable the feature of PreferNominatedNode, the pod might not be scheduled to the best candidate node in the cluster. (#93179, @chendave) [SIG Scheduling and Testing]
Persistent Volumes formatted with the btrfs filesystem will now automatically resize when expanded. (#99361, @Novex) [SIG Storage]
Port the devicemanager to Windows node to allow device plugins like directx (#93285, @aarnaud) [SIG Node, Testing and Windows]
Removes cAdvisor JSON metrics (/stats/container, /stats//, /stats////) from the kubelet. (#99236, @pacoxu)
Rename metrics etcd_object_counts to apiserver_storage_object_counts and mark it as stable. The original etcd_object_counts metrics name is marked as "Deprecated" and will be removed in the future. (#99785, @erain) [SIG API Machinery, Instrumentation and Testing]
Sysctls graduates to General Availability and thus unconditionally enabled. (#99158, @wgahnagl)
The Kubernetes pause image manifest list now contains an image for Windows Server 20H2. (#97322, @claudiubelu) [SIG Windows]
The NodeAffinity plugin implements the PreFilter extension, offering enhanced performance for Filter. (#99213, @AliceZhang2016) [SIG Scheduling]
The CronJobControllerV2 feature flag graduates to Beta and set to be enabled by default. (#98878, @soltysh)
The EndpointSlice mirroring controller mirrors endpoints annotations and labels to the generated endpoint slices, it also ensures that updates on any of these fields are mirrored.
The well-known annotation endpoints.kubernetes.io/last-change-trigger-time is skipped and not mirrored. (#98116, @aojea)
The RunAsGroup feature has been promoted to GA in this release. (#94641, @krmayankk) [SIG Auth and Node]
The ServiceAccountIssuerDiscovery feature has graduated to GA, and is unconditionally enabled. The ServiceAccountIssuerDiscovery feature-gate will be removed in 1.22. (#98553, @mtaufen) [SIG API Machinery, Auth and Testing]
The TTLAfterFinished feature flag is now beta and enabled by default (#98678, @ahg-g)
The apimachinery util/net function used to detect the bind address ResolveBindAddress() takes into consideration global IP addresses on loopback interfaces when 1) the host has default routes, or 2) there are no global IPs on those interfaces in order to support more complex network scenarios like BGP Unnumbered RFC 5549 (#95790, @aojea) [SIG Network]
The feature gate RootCAConfigMap graduated to GA in v1.21 and therefore will be unconditionally enabled. This flag will be removed in v1.22 release. (#98033, @zshihang)
The pause image upgraded to v3.4.1 in kubelet and kubeadm for both Linux and Windows. (#98205, @pacoxu)
Update pause container to run as pseudo user and group 65535:65535. This implies the release of version 3.5 of the container images. (#97963, @saschagrunert) [SIG CLI, Cloud Provider, Cluster Lifecycle, Node, Release, Security and Testing]
Update the latest validated version of Docker to 20.10 (#98977, @neolit123) [SIG CLI, Cluster Lifecycle and Node]
Upgrade node local dns to 1.17.0 for better IPv6 support (#99749, @pacoxu) [SIG Cloud Provider and Network]
Upgrades IPv6Dualstack to Beta and turns it on by default. New clusters or existing clusters are not be affected until an actor starts adding secondary Pods and service CIDRS CLI flags as described here: IPv4/IPv6 Dual-stack (#98969, @khenidak)
Users might specify the kubectl.kubernetes.io/default-container annotation in a Pod to preselect container for kubectl commands. (#99581, @mengjiao-liu) [SIG CLI]
When downscaling ReplicaSets, ready and creation timestamps are compared in a logarithmic scale. (#99212, @damemi) [SIG Apps and Testing]
When the kubelet is watching a ConfigMap or Secret purely in the context of setting environment variables
for containers, only hold that watch for a defined duration before cancelling it. This change reduces the CPU
and memory usage of the kube-apiserver in large clusters. (#99393, @chenyw1990) [SIG API Machinery, Node and Testing]
WindowsEndpointSliceProxying feature gate has graduated to beta and is enabled by default. This means kube-proxy will read from EndpointSlices instead of Endpoints on Windows by default. (#99794, @robscott) [SIG Network]
kubectl wait ensures that observedGeneration >= generation to prevent stale state reporting. An example scenario can be found on CRD updates. (#97408, @KnicKnic)
Documentation
Azure file migration graduates to beta, with CSIMigrationAzureFile flag off by default
as it requires installation of AzureFile CSI Driver. Users should enable CSIMigration and
CSIMigrationAzureFile features and install the AzureFile CSI Driver
to avoid disruption to existing Pod and PVC objects at that time. Azure File CSI driver does not support using same persistent
volume with different fsgroups. When CSI migration is enabled for azurefile driver, such case is not supported.
(there is a case we support where volume is mounted with 0777 and then it readable/writable by everyone) (#96293, @andyzhangx)
Official support to build kubernetes with docker-machine / remote docker is removed. This change does not affect building kubernetes with docker locally. (#97935, @adeniyistephen) [SIG Release and Testing]
Set kubelet option --volume-stats-agg-period to negative value to disable volume calculations. (#96675, @pacoxu) [SIG Node]
Failing Test
Escape the special characters like [, ] and that exist in vsphere windows path (#98830, @liyanhui1228) [SIG Storage and Windows]
Kube-proxy: fix a bug on UDP NodePort Services where stale connection tracking entries may blackhole the traffic directed to the NodePort (#98305, @aojea)
Kubelet: fixes a bug in the HostPort dockershim implementation that caused the conformance test "HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol" to fail. (#98755, @aojea) [SIG Cloud Provider, Network and Node]
Bug or Regression
AcceleratorStats will be available in the Summary API of kubelet when cri_stats_provider is used. (#96873, @ruiwen-zhao) [SIG Node]
All data is no longer automatically deleted when a failure is detected during creation of the volume data file on a CSI volume. Now only the data file and volume path is removed. (#96021, @huffmanca)
Clean ReplicaSet by revision instead of creation timestamp in deployment controller (#97407, @waynepeking348) [SIG Apps]
Cleanup subnet in frontend IP configs to prevent huge subnet request bodies in some scenarios. (#98133, @nilo19) [SIG Cloud Provider]
Client-go exec credential plugins will pass stdin only when interactive terminal is detected on stdin. This fixes a bug where previously it was checking if stdout is an interactive terminal. (#99654, @ankeesler)
Cloud-controller-manager: routes controller should not depend on --allocate-node-cidrs (#97029, @andrewsykim) [SIG Cloud Provider and Testing]
Cluster Autoscaler version bump to v1.20.0 (#97011, @towca)
Creating a PVC with DataSource should fail for non-CSI plugins. (#97086, @xing-yang) [SIG Apps and Storage]
EndpointSlice controller is now less likely to emit FailedToUpdateEndpointSlices events. (#99345, @robscott) [SIG Apps and Network]
EndpointSlice controllers are less likely to create duplicate EndpointSlices. (#100103, @robscott) [SIG Apps and Network]
EndpointSliceMirroring controller is now less likely to emit FailedToUpdateEndpointSlices events. (#99756, @robscott) [SIG Apps and Network]
Ensure all vSphere nodes are are tracked by volume attach-detach controller (#96689, @gnufied)
Ensure empty string annotations are copied over in rollbacks. (#94858, @waynepeking348)
Ensure only one LoadBalancer rule is created when HA mode is enabled (#99825, @feiskyer) [SIG Cloud Provider]
Ensure that client-go's EventBroadcaster is safe (non-racy) during shutdown. (#95664, @DirectXMan12) [SIG API Machinery]
Explicitly pass KUBE_BUILD_CONFORMANCE=y in package-tarballs to reenable building the conformance tarballs. (#100571, @puerco)
Fix Azure file migration e2e test failure when CSIMigration is turned on. (#97877, @andyzhangx)
Fix CSI-migrated inline EBS volumes failing to mount if their volumeID is prefixed by aws:// (#96821, @wongma7) [SIG Storage]
Fix CVE-2020-8555 for Gluster client connections. (#97922, @liggitt) [SIG Storage]
Fix NPE in ephemeral storage eviction (#98261, @wzshiming) [SIG Node]
Fix PermissionDenied issue on SMB mount for Windows (#99550, @andyzhangx)
Fix bug that would let the Horizontal Pod Autoscaler scale down despite at least one metric being unavailable/invalid (#99514, @mikkeloscar) [SIG Apps and Autoscaling]
Fix cgroup handling for systemd with cgroup v2 (#98365, @odinuge) [SIG Node]
Fix counting error in service/nodeport/loadbalancer quota check (#97451, @pacoxu) [SIG API Machinery, Network and Testing]
Fix errors when accessing Windows container stats for Dockershim (#98510, @jsturtevant) [SIG Node and Windows]
Fix kube-proxy container image architecture for non amd64 images. (#98526, @saschagrunert)
Fix nil VMSS name when setting service to auto mode (#97366, @nilo19) [SIG Cloud Provider]
Fix privileged config of Pod Sandbox which was previously ignored. (#96877, @xeniumlee)
Fix the panic when kubelet registers if a node object already exists with no Status.Capacity or Status.Allocatable (#95269, @SataQiu) [SIG Node]
Fix the regression with the slow pods termination. Before this fix pods may take an additional time to terminate - up to one minute. Reversing the change that ensured that CNI resources cleaned up when the pod is removed on API server. (#97980, @SergeyKanzhelev) [SIG Node]
Fix to recover CSI volumes from certain dangling attachments (#96617, @yuga711) [SIG Apps and Storage]
Fix: azure file latency issue for metadata-heavy workloads (#97082, @andyzhangx) [SIG Cloud Provider and Storage]
Fixed FibreChannel volume plugin corrupting filesystems on detach of multipath volumes. (#97013, @jsafrane) [SIG Storage]
Fixed a bug in kubelet that will saturate CPU utilization after containerd got restarted. (#97174, @hanlins) [SIG Node]
Fixed a bug that causes smaller number of conntrack-max being used under CPU static policy. (#99225, @xh4n3) (#99613, @xh4n3) [SIG Network]
Fixed a bug that on k8s nodes, when the policy of INPUT chain in filter table is not ACCEPT, healthcheck nodeport would not work.
Added iptables rules to allow healthcheck nodeport traffic. (#97824, @hanlins) [SIG Network]
Fixed a bug that the kubelet cannot start on BtrfS. (#98042, @gjkim42) [SIG Node]
Fixed a race condition on API server startup ensuring previously created webhook configurations are effective before the first write request is admitted. (#95783, @roycaihw) [SIG API Machinery]
Fixed an issue with garbage collection failing to clean up namespaced children of an object also referenced incorrectly by cluster-scoped children (#98068, @liggitt) [SIG API Machinery and Apps]
Fixed authentication_duration_seconds metric scope. Previously, it included whole apiserver request duration which yields inaccurate results. (#99944, @marseel)
Fixed bug in CPUManager with race on container map access (#97427, @klueska) [SIG Node]
Fixed bug that caused cAdvisor to incorrectly detect single-socket multi-NUMA topology. (#99315, @iwankgb) [SIG Node]
Fixed cleanup of block devices when /var/lib/kubelet is a symlink. (#96889, @jsafrane) [SIG Storage]
Fixed no effect namespace when exposing deployment with --dry-run=client. (#97492, @masap) [SIG CLI]
Fixed provisioning of Cinder volumes migrated to CSI when StorageClass with AllowedTopologies was used. (#98311, @jsafrane) [SIG Storage]
Fixes a bug of identifying the correct containerd process. (#97888, @pacoxu)
Fixes add-on manager leader election to use leases instead of endpoints, similar to what kube-controller-manager does in 1.20 (#98968, @liggitt)
Fixes connection errors when using --volume-host-cidr-denylist or --volume-host-allow-local-loopback (#98436, @liggitt) [SIG Network and Storage]
Fixes problem where invalid selector on PodDisruptionBudget leads to a nil pointer dereference that causes the Controller manager to crash loop. (#98750, @mortent)
Fixes spurious errors about IPv6 in kube-proxy logs on nodes with IPv6 disabled. (#99127, @danwinship)
Fixing a bug where a failed node may not have the NoExecute taint set correctly (#96876, @howieyuen) [SIG Apps and Node]
GCE Internal LoadBalancer sync loop will now release the ILB IP address upon sync failure. An error in ILB forwarding rule creation will no longer leak IP addresses. (#97740, @prameshj) [SIG Cloud Provider and Network]
Ignore update pod with no new images in alwaysPullImages admission controller (#96668, @pacoxu) [SIG Apps, Auth and Node]
Improve speed of vSphere PV provisioning and reduce number of API calls (#100054, @gnufied) [SIG Cloud Provider and Storage]
KUBECTL_EXTERNAL_DIFF now accepts equal sign for additional parameters. (#98158, @dougsland) [SIG CLI]
Kube-apiserver: an update of a pod with a generic ephemeral volume dropped that volume if the feature had been disabled since creating the pod with such a volume (#99446, @pohly) [SIG Apps, Node and Storage]
Kube-proxy: remove deprecated --cleanup-ipvs flag of kube-proxy, and make --cleanup flag always to flush IPVS (#97336, @maaoBit) [SIG Network]
Kubeadm installs etcd v3.4.13 when creating cluster v1.19 (#97244, @pacoxu)
Kubeadm: Fixes a kubeadm upgrade bug that could cause a custom CoreDNS configuration to be replaced with the default. (#97016, @rajansandeep) [SIG Cluster Lifecycle]
Kubeadm: Some text in the kubeadm upgrade plan output has changed. If you have scripts or other automation that parses this output, please review these changes and update your scripts to account for the new output. (#98728, @stmcginnis) [SIG Cluster Lifecycle]
Kubeadm: fix a bug in the host memory detection code on 32bit Linux platforms (#97403, @abelbarrera15) [SIG Cluster Lifecycle]
Kubeadm: fix a bug where "kubeadm join" would not properly handle missing names for existing etcd members. (#97372, @ihgann) [SIG Cluster Lifecycle]
Kubeadm: fix a bug where "kubeadm upgrade" commands can fail if CoreDNS v1.8.0 is installed. (#97919, @neolit123) [SIG Cluster Lifecycle]
Kubeadm: fix a bug where external credentials in an existing admin.conf prevented the CA certificate to be written in the cluster-info ConfigMap. (#98882, @kvaps) [SIG Cluster Lifecycle]
Kubeadm: get k8s CI version markers from k8s infra bucket (#98836, @hasheddan) [SIG Cluster Lifecycle and Release]
Kubeadm: skip validating pod subnet against node-cidr-mask when allocate-node-cidrs is set to be false (#98984, @SataQiu) [SIG Cluster Lifecycle]
Kubectl logs: --ignore-errors is now honored by all containers, maintaining consistency with parallelConsumeRequest behavior. (#97686, @wzshiming)
Kubectl-convert: Fix no kind "Ingress" is registered for version error (#97754, @wzshiming)
Kubectl: Fixed panic when describing an ingress backend without an API Group (#100505, @lauchokyip) [SIG CLI]
Kubelet now cleans up orphaned volume directories automatically (#95301, @lorenz) [SIG Node and Storage]
Kubelet.exe on Windows now checks that the process running as administrator and the executing user account is listed in the built-in administrators group. This is the equivalent to checking the process is running as uid 0. (#96616, @perithompson) [SIG Node and Windows]
Kubelet: Fix kubelet from panic after getting the wrong signal (#98200, @wzshiming) [SIG Node]
Kubelet: Fixed the bug of getting the number of cpu when the number of cpu logical processors is more than 64 in windows (#97378, @hwdef) [SIG Node and Windows]
Limits lease to have 1000 maximum attached objects. (#98257, @lingsamuel)
Mitigate CVE-2020-8555 for kube-up using GCE by preventing local loopback folume hosts. (#97934, @mattcary) [SIG Cloud Provider and Storage]
On single-stack configured (IPv4 or IPv6, but not both) clusters, Services which are both headless (no clusterIP) and selectorless (empty or undefined selector) will report ipFamilyPolicy RequireDualStack and will have entries in ipFamilies[] for both IPv4 and IPv6. This is a change from alpha, but does not have any impact on the manually-specified Endpoints and EndpointSlices for the Service. (#99555, @thockin) [SIG Apps and Network]
Performance regression #97685 has been fixed. (#97860, @MikeSpreitzer) [SIG API Machinery]
Pod Log stats for windows now reports metrics (#99221, @jsturtevant) [SIG Node, Storage, Testing and Windows]
Pod status updates faster when reacting on probe results. The first readiness probe will be called faster when startup probes succeeded, which will make Pod status as ready faster. (#98376, @matthyx)
Readjust kubelet_containers_per_pod_count buckets to only show metrics greater than 1. (#98169, @wawa0210)
Remove CSI topology from migrated in-tree gcepd volume. (#97823, @Jiawei0227) [SIG Cloud Provider and Storage]
Requests with invalid timeout parameters in the request URL now appear in the audit log correctly. (#96901, @tkashem) [SIG API Machinery and Testing]
Resolve a "concurrent map read and map write" crashing error in the kubelet (#95111, @choury) [SIG Node]
Resolves spurious Failed to list *v1.Secret or Failed to list *v1.ConfigMap messages in kubelet logs. (#99538, @liggitt) [SIG Auth and Node]
ResourceQuota of an entity now inclusively calculate Pod overhead (#99600, @gjkim42)
Return zero time (midnight on Jan. 1, 1970) instead of negative number when reporting startedAt and finishedAt of the not started or a running Pod when using dockershim as a runtime. (#99585, @Iceber)
Reverts breaking change to inline AzureFile volumes; referenced secrets are now searched for in the same namespace as the pod as in previous releases. (#100563, @msau42)
Scores from InterPodAffinity have stronger differentiation. (#98096, @leileiwan) [SIG Scheduling]
Specifying the KUBE_TEST_REPO environment variable when e2e tests are executed will instruct the test infrastructure to load that image from a location within the specified repo, using a predefined pattern. (#93510, @smarterclayton) [SIG Testing]
Static pods will be deleted gracefully. (#98103, @gjkim42) [SIG Node]
Sync node status during kubelet node shutdown.
Adds an pod admission handler that rejects new pods when the node is in progress of shutting down. (#98005, @wzshiming) [SIG Node]
The calculation of pod UIDs for static pods has changed to ensure each static pod gets a unique value - this will cause all static pod containers to be recreated/restarted if an in-place kubelet upgrade from 1.20 to 1.21 is performed. Note that draining pods before upgrading the kubelet across minor versions is the supported upgrade path. (#87461, @bboreham) [SIG Node]
The maximum number of ports allowed in EndpointSlices has been increased from 100 to 20,000 (#99795, @robscott) [SIG Network]
Truncates a message if it hits the NoteLengthLimit when the scheduler records an event for the pod that indicates the pod has failed to schedule. (#98715, @carlory)
Updated k8s.gcr.io/ingress-gce-404-server-with-metrics-amd64 to a version that serves /metrics endpoint on a non-default port. (#97621, @vbannai) [SIG Cloud Provider]
Updates the commands `
kubectl kustomize {arg}
kubectl apply -k {arg}
`to use same code as kustomize CLI v4.0.5 (#98946, @monopole)
Use force unmount for NFS volumes if regular mount fails after 1 minute timeout (#96844, @gnufied) [SIG Storage]
Use network.Interface.VirtualMachine.ID to get the binded VM
Skip standalone VM when reconciling LoadBalancer (#97635, @nilo19) [SIG Cloud Provider]
Using exec auth plugins with kubectl no longer results in warnings about constructing many client instances from the same exec auth config. (#97857, @liggitt) [SIG API Machinery and Auth]
When a CNI plugin returns dual-stack pod IPs, kubelet will now try to respect the
"primary IP family" of the cluster by picking a primary pod IP of the same family
as the (primary) node IP, rather than assuming that the CNI plugin returned the IPs
in the order the administrator wanted (since some CNI plugins don't allow
configuring this). (#97979, @danwinship) [SIG Network and Node]
When dynamically provisioning Azure File volumes for a premium account, the requested size will be set to 100GB if the request is initially lower than this value to accommodate Azure File requirements. (#99122, @huffmanca) [SIG Cloud Provider and Storage]
When using Containerd on Windows, the C:\Windows\System32\drivers\etc\hosts file will now be managed by kubelet. (#83730, @claudiubelu)
VolumeBindingArgs now allow BindTimeoutSeconds to be set as zero, while the value zero indicates no waiting for the checking of volume binding operation. (#99835, @chendave) [SIG Scheduling and Storage]
kubectl exec and kubectl attach now honor the --quiet flag which suppresses output from the local binary that could be confused by a script with the remote command output (all non-failure output is hidden). In addition, print inline with exec and attach the list of alternate containers when we default to the first spec.container. (#99004, @smarterclayton) [SIG CLI]
Other (Cleanup or Flake)
APIs for kubelet annotations and labels from k8s.io/kubernetes/pkg/kubelet/apis are now moved under k8s.io/kubelet/pkg/apis/ (#98931, @michaelbeaumont)
Apiserver_request_duration_seconds is promoted to stable status. (#99925, @logicalhan) [SIG API Machinery, Instrumentation and Testing]
Bump github.com/Azure/go-autorest/autorest to v0.11.12 (#97033, @patrickshan) [SIG API Machinery, CLI, Cloud Provider and Cluster Lifecycle]
Clients required to use go1.15.8+ or go1.16+ if kube-apiserver has the goaway feature enabled to avoid unexpected data race condition. (#98809, @answer1991)
Delete deprecated service.beta.kubernetes.io/azure-load-balancer-mixed-protocols mixed procotol annotation in favor of the MixedProtocolLBService feature (#97096, @nilo19) [SIG Cloud Provider]
EndpointSlice generation is now incremented when labels change. (#99750, @robscott) [SIG Network]
Featuregate AllowInsecureBackendProxy graduates to GA and unconditionally enabled. (#99658, @deads2k)
Increase timeout for pod lifecycle test to reach pod status=ready (#96691, @hh)
Increased CSINodeIDMaxLength from 128 bytes to 192 bytes. (#98753, @Jiawei0227)
Kube-apiserver: The OIDC authenticator no longer waits 10 seconds before attempting to fetch the metadata required to verify tokens. (#97693, @enj) [SIG API Machinery and Auth]
Kube-proxy: Traffic from the cluster directed to ExternalIPs is always sent directly to the Service. (#96296, @aojea) [SIG Network and Testing]
Kubeadm: change the default image repository for CI images from 'gcr.io/kubernetes-ci-images' to 'gcr.io/k8s-staging-ci-images' (#97087, @SataQiu) [SIG Cluster Lifecycle]
Kubectl: The deprecated kubectl alpha debug command is removed. Use kubectl debug instead. (#98111, @pandaamanda) [SIG CLI]
Kubelet command line flags related to dockershim are now showing deprecation message as they will be removed along with dockershim in future release. (#98730, @dims)
Official support to build kubernetes with docker-machine / remote docker is removed. This change does not affect building kubernetes with docker locally. (#97618, @jherrera123) [SIG Release and Testing]
Process start time on Windows now uses current process information (#97491, @jsturtevant) [SIG API Machinery, CLI, Cloud Provider, Cluster Lifecycle, Instrumentation and Windows]
Resolves flakes in the Ingress conformance tests due to conflicts with controllers updating the Ingress object (#98430, @liggitt) [SIG Network and Testing]
The AttachVolumeLimit feature gate (GA since v1.17) has been removed and now unconditionally enabled. (#96539, @ialidzhikov)
The CSINodeInfo feature gate that is GA since v1.17 is unconditionally enabled, and can no longer be specified via the --feature-gates argument. (#96561, @ialidzhikov) [SIG Apps, Auth, Scheduling, Storage and Testing]
The apiserver_request_total metric is promoted to stable status and no longer has a content-type dimensions, so any alerts/charts which presume the existence of this will fail. This is however, unlikely to be the case since it was effectively an unbounded dimension in the first place. (#99788, @logicalhan)
The default delegating authorization options now allow unauthenticated access to healthz, readyz, and livez. A system:masters user connecting to an authz delegator will not perform an authz check. (#98325, @deads2k) [SIG API Machinery, Auth, Cloud Provider and Scheduling]
The deprecated feature gates CSIDriverRegistry, BlockVolume and CSIBlockVolume are now unconditionally enabled and can no longer be specified in component invocations. (#98021, @gavinfish) [SIG Storage]
The deprecated feature gates RotateKubeletClientCertificate, AttachVolumeLimit, VolumePVCDataSource and EvenPodsSpread are now unconditionally enabled and can no longer be specified in component invocations. (#97306, @gavinfish) [SIG Node, Scheduling and Storage]
The e2e suite can be instructed not to wait for pods in kube-system to be ready or for all nodes to be ready by passing --allowed-not-ready-nodes=-1 when invoking the e2e.test program. This allows callers to run subsets of the e2e suite in scenarios other than perfectly healthy clusters. (#98781, @smarterclayton) [SIG Testing]
The feature gates WindowsGMSA and WindowsRunAsUserName that are GA since v1.18 are now removed. (#96531, @ialidzhikov) [SIG Node and Windows]
The new -gce-zones flag on the e2e.test binary instructs tests that check for information about how the cluster interacts with the cloud to limit their queries to the provided zone list. If not specified, the current behavior of asking the cloud provider for all available zones in multi zone clusters is preserved. (#98787, @smarterclayton) [SIG API Machinery, Cluster Lifecycle and Testing]
Windows nodes on GCE will take longer to start due to dependencies installed at node creation time. (#98284, @pjh) [SIG Cloud Provider]
apiserver_storage_objects (a newer version of etcd_object_counts) is promoted and marked as stable. (#100082, @logicalhan)
Uncategorized
GCE L4 Loadbalancers now handle > 5 ports in service spec correctly. (#99595, @prameshj) [SIG Cloud Provider]
The DownwardAPIHugePages feature is beta. Users may use the feature if all workers in their cluster are min 1.20 version. The feature will be enabled by default in all installations in 1.22. (#99610, @derekwaynecarr) [SIG Node]
(No, really, you MUST read this before you upgrade)
Migrated pkg/kubelet/cm/cpuset/cpuset.go to structured logging. Exit code changed from 255 to 1. (#100007, @utsavoza) [SIG Instrumentation and Node]
Changes by Kind
API Change
Add Probe-level terminationGracePeriodSeconds field (#99375, @ehashman) [SIG API Machinery, Apps, Node and Testing]
CSIServiceAccountToken is Beta now (#99298, @zshihang) [SIG Auth, Storage and Testing]
Discovery.k8s.io/v1beta1 EndpointSlices are deprecated in favor of discovery.k8s.io/v1, and will no longer be served in Kubernetes v1.25. (#100472, @liggitt) [SIG Network]
FieldManager no longer owns fields that get reset before the object is persisted (e.g. "status wiping"). (#99661, @kevindelgado) [SIG API Machinery, Auth and Testing]
Generic ephemeral volumes are beta. (#99643, @pohly) [SIG API Machinery, Apps, Auth, CLI, Node, Storage and Testing]
Implement the GetAvailableResources in the podresources API. (#95734, @fromanirh) [SIG Instrumentation, Node and Testing]
The Endpoints controller will now set the endpoints.kubernetes.io/over-capacity annotation to "warning" when an Endpoints resource contains more than 1000 addresses. In a future release, the controller will truncate Endpoints that exceed this limit. The EndpointSlice API can be used to support significantly larger number of addresses. (#99975, @robscott) [SIG Apps and Network]
The PodDisruptionBudget API has been promoted to policy/v1 with no schema changes. The only functional change is that an empty selector ({}) written to a policy/v1 PodDisruptionBudget now selects all pods in the namespace. The behavior of the policy/v1beta1 API remains unchanged. The policy/v1beta1 PodDisruptionBudget API is deprecated and will no longer be served in 1.25+. (#99290, @mortent) [SIG API Machinery, Apps, Auth, Autoscaling, CLI, Cloud Provider, Cluster Lifecycle, Instrumentation, Scheduling and Testing]
Topology Aware Hints are now available in alpha and can be enabled with the TopologyAwareHints feature gate. (#99522, @robscott) [SIG API Machinery, Apps, Auth, Instrumentation, Network and Testing]
Feature
Add e2e test to validate performance metrics of volume lifecycle operations (#94334, @RaunakShah) [SIG Storage and Testing]
EmptyDir memory backed volumes are sized as the the minimum of pod allocatable memory on a host and an optional explicit user provided value. (#100319, @derekwaynecarr) [SIG Node]
Enables Kubelet to check volume condition and log events to corresponding pods. (#99284, @fengzixu) [SIG Apps, Instrumentation, Node and Storage]
Introduce a churn operator to scheduler perf testing framework. (#98900, @Huang-Wei) [SIG Scheduling and Testing]
Kubernetes is now built with Golang 1.16.1 (#100106, @justaugustus) [SIG Cloud Provider, Instrumentation, Release and Testing]
Migrated pkg/kubelet/cm/devicemanager to structured logging (#99976, @knabben) [SIG Instrumentation and Node]
Migrated pkg/kubelet/cm/memorymanager to structured logging (#99974, @knabben) [SIG Instrumentation and Node]
Migrated pkg/kubelet/cm/topologymanager to structure logging (#99969, @knabben) [SIG Instrumentation and Node]
Rename metrics etcd_object_counts to apiserver_storage_object_counts and mark it as stable. The original etcd_object_counts metrics name is marked as "Deprecated" and will be removed in the future. (#99785, @erain) [SIG API Machinery, Instrumentation and Testing]
Update pause container to run as pseudo user and group 65535:65535. This implies the release of version 3.5 of the container images. (#97963, @saschagrunert) [SIG CLI, Cloud Provider, Cluster Lifecycle, Node, Release, Security and Testing]
Users might specify the kubectl.kubernetes.io/default-exec-container annotation in a Pod to preselect container for kubectl commands. (#99833, @mengjiao-liu) [SIG CLI]
Bug or Regression
Add ability to skip OpenAPI handler installation to the GenericAPIServer (#100341, @kevindelgado) [SIG API Machinery]
Count pod overhead against an entity's ResourceQuota (#99600, @gjkim42) [SIG API Machinery and Node]
EndpointSlice controllers are less likely to create duplicate EndpointSlices. (#100103, @robscott) [SIG Apps and Network]
Ensure only one LoadBalancer rule is created when HA mode is enabled (#99825, @feiskyer) [SIG Cloud Provider]
Fixed a race condition on API server startup ensuring previously created webhook configurations are effective before the first write request is admitted. (#95783, @roycaihw) [SIG API Machinery]
Fixed authentication_duration_seconds metric. Previously it included whole apiserver request duration. (#99944, @marseel) [SIG API Machinery, Instrumentation and Scalability]
Fixes issue where inline AzueFile secrets could not be accessed from the pod's namespace. (#100563, @msau42) [SIG Storage]
Improve speed of vSphere PV provisioning and reduce number of API calls (#100054, @gnufied) [SIG Cloud Provider and Storage]
Kubectl: Fixed panic when describing an ingress backend without an API Group (#100505, @lauchokyip) [SIG CLI]
Kubectl: fix case of age column in describe node (#96963, @bl-ue) (#96963, @bl-ue) [SIG CLI]
Kubelet.exe on Windows now checks that the process running as administrator and the executing user account is listed in the built-in administrators group. This is the equivalent to checking the process is running as uid 0. (#96616, @perithompson) [SIG Node and Windows]
Kubelet: Fixed the bug of getting the number of cpu when the number of cpu logical processors is more than 64 in windows (#97378, @hwdef) [SIG Node and Windows]
Pass KUBE_BUILD_CONFORMANCE=y to the package-tarballs to reenable building the conformance tarballs. (#100571, @puerco) [SIG Release]
Pod Log stats for windows now reports metrics (#99221, @jsturtevant) [SIG Node, Storage, Testing and Windows]
Other (Cleanup or Flake)
A new storage E2E testsuite covers CSIStorageCapacity publishing if a driver opts into the test. (#100537, @pohly) [SIG Storage and Testing]
Convert cmd/kubelet/app/server.go to structured logging (#98334, @wawa0210) [SIG Node]
If kube-apiserver enabled goaway feature, clients required golang 1.15.8 or 1.16+ version to avoid un-expected data race issue. (#98809, @answer1991) [SIG API Machinery]
Increased CSINodeIDMaxLength from 128 bytes to 192 bytes. (#98753, @Jiawei0227) [SIG Apps and Storage]
Migrate pkg/kubelet/pluginmanager to structured logging (#99885, @qingwave) [SIG Node]
Migrate pkg/kubelet/preemption/preemption.go and pkg/kubelet/logs/container_log_manager.go to structured logging (#99848, @qingwave) [SIG Node]
Migrate pkg/kubelet/(volume,container) to structured logging (#98850, @yangjunmyfm192085) [SIG Node]
Migrate pkg/kubelet/kubelet_node_status.go to structured logging (#98154, @yangjunmyfm192085) [SIG Node and Release]
Migrate pkg/kubelet/lifecycle,oom to structured logging (#99479, @mengjiao-liu) [SIG Instrumentation and Node]
Migrate cmd/kubelet/+ pkg/kubelet/cadvisor/cadvisor_linux.go + pkg/kubelet/cri/remote/util/util_unix.go + pkg/kubelet/images/image_manager.go to structured logging (#99994, @AfrouzMashayekhi) [SIG Instrumentation and Node]
Migrate pkg/kubelet/cm/container_manager_linux.go and pkg/kubelet/cm/container_manager_stub.go to structured logging (#100001, @shiyajuan123) [SIG Instrumentation and Node]
Migrate pkg/kubelet/cm/cpumanage/{topology/togit pology.go, policy_none.go, cpu_assignment.go} to structured logging (#100163, @lala123912) [SIG Instrumentation and Node]
Migrate pkg/kubelet/cm/cpumanager/state to structured logging (#99563, @jmguzik) [SIG Instrumentation and Node]
Migrate pkg/kubelet/config to structured logging (#100002, @AfrouzMashayekhi) [SIG Instrumentation and Node]
Migrate pkg/kubelet/kubelet.go to structured logging (#99861, @navidshaikh) [SIG Instrumentation and Node]
Migrate pkg/kubelet/kubeletconfig to structured logging (#100265, @ehashman) [SIG Node]
Migrate pkg/kubelet/kuberuntime to structured logging (#99970, @krzysiekg) [SIG Instrumentation and Node]
Migrate pkg/kubelet/prober to structured logging (#99830, @krzysiekg) [SIG Instrumentation and Node]
Migrate pkg/kubelet/winstats to structured logging (#99855, @hexxdump) [SIG Instrumentation and Node]
Migrate probe log messages to structured logging (#97093, @aldudko) [SIG Instrumentation and Node]
Migrate remaining kubelet files to structured logging (#100196, @ehashman) [SIG Instrumentation and Node]
apiserver_storage_objects (a newer version of `etcd_object_counts) is promoted and marked as stable. (#100082, @logicalhan) [SIG API Machinery, Instrumentation and Testing]
(No, really, you MUST read this before you upgrade)
Kubeadm: during "init" an empty cgroupDriver value in the KubeletConfiguration is now always set to "systemd" unless the user is explicit about it. This requires existing machine setups to configure the container runtime to use the "systemd" driver. Documentation on this topic can be found here: https://kubernetes.io/docs/setup/production-environment/container-runtimes/. When upgrading existing clusters / nodes using "kubeadm upgrade" the old cgroupDriver value is preserved, but in 1.22 this change will also apply to "upgrade". For more information on migrating to the "systemd" driver or remaining on the "cgroupfs" driver see: https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/. (#99471, @neolit123) [SIG Cluster Lifecycle]
Migrate pkg/kubelet/(dockershim, network) to structured logging
Exit code changed from 255 to 1 (#98939, @yangjunmyfm192085) [SIG Network and Node]
Migrate pkg/kubelet/certificate to structured logging
Exit code changed from 255 to 1 (#98993, @SataQiu) [SIG Auth and Node]
Newly provisioned PVs by EBS plugin will no longer use the deprecated "failure-domain.beta.kubernetes.io/zone" and "failure-domain.beta.kubernetes.io/region" labels. It will use "topology.kubernetes.io/zone" and "topology.kubernetes.io/region" labels instead. (#99130, @ayberk) [SIG Cloud Provider, Storage and Testing]
Newly provisioned PVs by OpenStack Cinder plugin will no longer use the deprecated "failure-domain.beta.kubernetes.io/zone" and "failure-domain.beta.kubernetes.io/region" labels. It will use "topology.kubernetes.io/zone" and "topology.kubernetes.io/region" labels instead. (#99719, @jsafrane) [SIG Cloud Provider and Storage]
OpenStack Cinder CSI migration is on by default, Clinder CSI driver must be installed on clusters on OpenStack for Cinder volumes to work. (#98538, @dims) [SIG Storage]
Package pkg/kubelet/server migrated to structured logging
Exit code changed from 255 to 1 (#99838, @adisky) [SIG Node]
Pkg/kubelet/kuberuntime/kuberuntime_manager.go migrated to structured logging
Exit code changed from 255 to 1 (#99841, @adisky) [SIG Instrumentation and Node]
Changes by Kind
Deprecation
Kubeadm: the deprecated kube-dns is no longer supported as an option. If "ClusterConfiguration.dns.type" is set to "kube-dns" kubeadm will now throw an error. (#99646, @rajansandeep) [SIG Cluster Lifecycle]
Remove deprecated --generator --replicas --service-generator --service-overrides --schedule from kubectl run
Deprecate --serviceaccount --hostport --requests --limits in kubectl run (#99732, @soltysh) [SIG CLI and Testing]
audit.k8s.io/v1beta1 and audit.k8s.io/v1alpha1 audit policy configuration and audit events are deprecated in favor of audit.k8s.io/v1, available since v1.13. kube-apiserver invocations that specify alpha or beta policy configurations with --audit-policy-file, or explicitly request alpha or beta audit events with --audit-log-version / --audit-webhook-version must update to use audit.k8s.io/v1 and accept audit.k8s.io/v1 events prior to v1.24. (#98858, @carlory) [SIG Auth]
diskformat stroage class parameter for in-tree vSphere volume plugin is deprecated as of v1.21 release. Please consider updating storageclass and remove diskformat parameter. vSphere CSI Driver does not support diskformat storageclass parameter.
vSphere releases less than 67u3 are deprecated as of v1.21. Please consider upgrading vSphere to 67u3 or above. vSphere CSI Driver requires minimum vSphere 67u3.
VM Hardware version less than 15 is deprecated as of v1.21. Please consider upgrading the Node VM Hardware version to 15 or above. vSphere CSI Driver recommends Node VM's Hardware version set to at least vmx-15.
Multi vCenter support is deprecated as of v1.21. If you have a Kubernetes cluster spanning across multiple vCenter servers, please consider moving all k8s nodes to a single vCenter Server. vSphere CSI Driver does not support Kubernetes deployment spanning across multiple vCenter servers.
Support for these deprecations will be available till Kubernetes v1.24. (#98546, @divyenpatel) [SIG Cloud Provider and Storage]
API Change
PodAffinityTerm includes a namespaceSelector field to allow selecting eligible namespaces based on their labels.
A new CrossNamespacePodAffinity quota scope API that allows restricting which namespaces allowed to use PodAffinityTerm with corss-namespace reference via namespaceSelector or namespaces fields. (#98582, @ahg-g) [SIG API Machinery, Apps, Auth and Testing]
Add a default metadata name labels for selecting any namespace by its name. (#96968, @jayunit100) [SIG API Machinery, Apps, Cloud Provider, Storage and Testing]
Added .spec.completionMode field to Job, with accepted values NonIndexed (default) and Indexed (#98441, @alculquicondor) [SIG Apps and CLI]
DaemonSets accept a MaxSurge integer or percent on their rolling update strategy that will launch the updated pod on nodes and wait for those pods to go ready before marking the old out-of-date pods as deleted. This allows workloads to avoid downtime during upgrades when deployed using DaemonSets. This feature is alpha and is behind the DaemonSetUpdateSurge feature gate. (#96441, @smarterclayton) [SIG Apps and Testing]
EndpointSlice API is now GA. The EndpointSlice topology field has been removed from the GA API and will be replaced by a new per Endpoint Zone field. If the topology field was previously used, it will be converted into an annotation in the v1 Resource. The discovery.k8s.io/v1alpha1 API is removed. (#99662, @swetharepakula) [SIG API Machinery, CLI, Cloud Provider, Cluster Lifecycle, Instrumentation, Network and Testing]
EndpointSlice Controllers are now GA. The EndpointSlice Controller will not populate the deprecatedTopology field and will only provide topology information through the zone and nodeName fields. (#99870, @swetharepakula) [SIG API Machinery, Apps, Auth, Network and Testing]
IngressClass resource can now reference a resource in a specific namespace
for implementation-specific configuration(previously only Cluster-level resources were allowed).
This feature can be enabled using the IngressClassNamespacedParams feature gate. (#99275, @hbagdi) [SIG API Machinery, CLI and Network]
Introduce conditions for PodDisruptionBudget (#98127, @mortent) [SIG API Machinery, Apps, Auth, CLI, Cloud Provider, Cluster Lifecycle and Instrumentation]
Jobs API has a new .spec.suspend field that can be used to suspend and resume Jobs (#98727, @adtac) [SIG API Machinery, Apps, Node, Scheduling and Testing]
Kubelet Graceful Node Shutdown feature is now beta. (#99735, @bobbypage) [SIG Node]
Limit the quest value of hugepage to integer multiple of page size. (#98515, @lala123912) [SIG Apps]
One new field "InternalTrafficPolicy" in Service is added.
It specifies if the cluster internal traffic should be routed to all endpoints or node-local endpoints only.
"Cluster" routes internal traffic to a Service to all endpoints.
"Local" routes traffic to node-local endpoints only, and traffic is dropped if no node-local endpoints are ready.
The default value is "Cluster". (#96600, @maplain) [SIG API Machinery, Apps and Network]
PodSecurityPolicy only stores "generic" as allowed volume type if the GenericEphemeralVolume feature gate is enabled (#98918, @pohly) [SIG Auth and Security]
Promote CronJobs to batch/v1 (#99423, @soltysh) [SIG API Machinery, Apps, CLI and Testing]
Remove support for building Kubernetes with bazel. (#99561, @BenTheElder) [SIG API Machinery, Apps, Architecture, Auth, Autoscaling, CLI, Cloud Provider, Cluster Lifecycle, Instrumentation, Network, Node, Release, Scalability, Scheduling, Storage, Testing and Windows]
Setting loadBalancerClass in load balancer type of service is available with this PR.
Users who want to use a custom load balancer can specify loadBalancerClass to achieve it. (#98277, @XudongLiuHarold) [SIG API Machinery, Apps, Cloud Provider and Network]
Storage capacity tracking (= the CSIStorageCapacity feature) is beta, storage.k8s.io/v1alpha1/VolumeAttachment and storage.k8s.io/v1alpha1/CSIStorageCapacity objects are deprecated (#99641, @pohly) [SIG API Machinery, Apps, Auth, Scheduling, Storage and Testing]
Support for Indexed Job: a Job that is considered completed when Pods associated to indexes from 0 to (.spec.completions-1) have succeeded. (#98812, @alculquicondor) [SIG Apps and CLI]
The apiserver now resets managedFields that got corrupted by a mutating admission controller. (#98074, @kwiesmueller) [SIG API Machinery and Testing]
controller.kubernetes.io/pod-deletion-cost annotation can be set to offer a hint on the cost of deleting a pod compared to other pods belonging to the same ReplicaSet. Pods with lower deletion cost are deleted first. This is an alpha feature. (#99163, @ahg-g) [SIG Apps]
Feature
A client-go metric, rest_client_exec_plugin_call_total, has been added to track total calls to client-go credential plugins. (#98892, @ankeesler) [SIG API Machinery, Auth, Cluster Lifecycle and Instrumentation]
Add --use-protocol-buffers flag to kubectl top pods and nodes (#96655, @serathius) [SIG CLI]
Add support to generate client-side binaries for new darwin/arm64 platform (#97743, @dims) [SIG Release and Testing]
Added ephemeral_volume_controller_create[_failures]_total counters to kube-controller-manager metrics (#99115, @pohly) [SIG API Machinery, Apps, Cluster Lifecycle, Instrumentation and Storage]
Adds alpha feature VolumeCapacityPriority which makes the scheduler prioritize nodes based on the best matching size of statically provisioned PVs across multiple topologies. (#96347, @cofyc) [SIG Apps, Network, Scheduling, Storage and Testing]
Adds two new metrics to cronjobs, a histogram to track the time difference when a job is created and the expected time when it should be created, and a gauge for the missed schedules of a cronjob (#99341, @alaypatel07) [SIG Apps and Instrumentation]
Alpha implementation of Kubectl Command Headers: SIG CLI KEP 859 enabled when KUBECTL_COMMAND_HEADERS environment variable set on the client command line.
To enable: export KUBECTL_COMMAND_HEADERS=1; kubectl ... (#98952, @seans3) [SIG API Machinery and CLI]
Component owner can configure the allowlist of metric label with flag '--allow-metric-labels'. (#99738, @YoyinZyc) [SIG API Machinery, Cluster Lifecycle and Instrumentation]
Disruption controller only sends one event per PodDisruptionBudget if scale can't be computed (#98128, @mortent) [SIG Apps]
EndpointSliceNodeName will always be enabled, so NodeName will always be available in the v1beta1 API. (#99746, @swetharepakula) [SIG Apps and Network]
Graduate CRIContainerLogRotation feature gate to GA. (#99651, @umohnani8) [SIG Node and Testing]
Kube-proxy iptables: new metric sync_proxy_rules_iptables_total that exposes the number of rules programmed per table in each iteration (#99653, @aojea) [SIG Instrumentation and Network]
Kube-scheduler now logs plugin scoring summaries at --v=4 (#99411, @damemi) [SIG Scheduling]
Kubeadm: a warning to user as ipv6 site-local is deprecated (#99574, @pacoxu) [SIG Cluster Lifecycle and Network]
Kubeadm: apply the "node.kubernetes.io/exclude-from-external-load-balancers" label on control plane nodes during "init", "join" and "upgrade" to preserve backwards compatibility with the lagacy LB mode where nodes labeled as "master" where excluded. To opt-out you can remove the label from a node. See #97543 and the linked KEP for more details. (#98269, @neolit123) [SIG Cluster Lifecycle]
Kubeadm: if the user has customized their image repository via the kubeadm configuration, pass the custom pause image repository and tag to the kubelet via --pod-infra-container-image not only for Docker but for all container runtimes. This flag tells the kubelet that it should not garbage collect the image. (#99476, @neolit123) [SIG Cluster Lifecycle]
Kubectl version changed to write a warning message to stderr if the client and server version difference exceeds the supported version skew of +/-1 minor version. (#98250, @brianpursley) [SIG CLI]
Kubernetes is now built with Golang 1.16 (#98572, @justaugustus) [SIG API Machinery, Auth, CLI, Cloud Provider, Cluster Lifecycle, Instrumentation, Node, Release and Testing]
Persistent Volumes formatted with the btrfs filesystem will now automatically resize when expanded. (#99361, @Novex) [SIG Storage]
Remove cAdvisor json metrics api collected by Kubelet (#99236, @pacoxu) [SIG Node]
Sysctls is now GA and locked to default (#99158, @wgahnagl) [SIG Node]
The NodeAffinity plugin implements the PreFilter extension, offering enhanced performance for Filter. (#99213, @AliceZhang2016) [SIG Scheduling]
The endpointslice mirroring controller mirrors endpoints annotations and labels to the generated endpoint slices, it also ensures that updates on any of these fields are mirrored.
The well-known annotation endpoints.kubernetes.io/last-change-trigger-time is skipped and not mirrored. (#98116, @aojea) [SIG Apps, Network and Testing]
Update the latest validated version of Docker to 20.10 (#98977, @neolit123) [SIG CLI, Cluster Lifecycle and Node]
Upgrade node local dns to 1.17.0 for better IPv6 support (#99749, @pacoxu) [SIG Cloud Provider and Network]
Users might specify the kubectl.kubernetes.io/default-exec-container annotation in a Pod to preselect container for kubectl commands. (#99581, @mengjiao-liu) [SIG CLI]
When downscaling ReplicaSets, ready and creation timestamps are compared in a logarithmic scale. (#99212, @damemi) [SIG Apps and Testing]
When the kubelet is watching a ConfigMap or Secret purely in the context of setting environment variables
for containers, only hold that watch for a defined duration before cancelling it. This change reduces the CPU
and memory usage of the kube-apiserver in large clusters. (#99393, @chenyw1990) [SIG API Machinery, Node and Testing]
WindowsEndpointSliceProxying feature gate has graduated to beta and is enabled by default. This means kube-proxy will read from EndpointSlices instead of Endpoints on Windows by default. (#99794, @robscott) [SIG Network]
Bug or Regression
Creating a PVC with DataSource should fail for non-CSI plugins. (#97086, @xing-yang) [SIG Apps and Storage]
EndpointSlice controller is now less likely to emit FailedToUpdateEndpointSlices events. (#99345, @robscott) [SIG Apps and Network]
EndpointSliceMirroring controller is now less likely to emit FailedToUpdateEndpointSlices events. (#99756, @robscott) [SIG Apps and Network]
Fix --ignore-errors does not take effect if multiple logs are printed and unfollowed (#97686, @wzshiming) [SIG CLI]
Fix bug that would let the Horizontal Pod Autoscaler scale down despite at least one metric being unavailable/invalid (#99514, @mikkeloscar) [SIG Apps and Autoscaling]
Fix cgroup handling for systemd with cgroup v2 (#98365, @odinuge) [SIG Node]
Fix smb mount PermissionDenied issue on Windows (#99550, @andyzhangx) [SIG Cloud Provider, Storage and Windows]
Fixed a bug that causes smaller number of conntrack-max being used under CPU static policy. (#99225, @xh4n3) (#99613, @xh4n3) [SIG Network]
Fixed bug that caused cAdvisor to incorrectly detect single-socket multi-NUMA topology. (#99315, @iwankgb) [SIG Node]
Improved update time of pod statuses following new probe results. (#98376, @matthyx) [SIG Node and Testing]
Kube-apiserver: an update of a pod with a generic ephemeral volume dropped that volume if the feature had been disabled since creating the pod with such a volume (#99446, @pohly) [SIG Apps, Node and Storage]
Kubeadm: skip validating pod subnet against node-cidr-mask when allocate-node-cidrs is set to be false (#98984, @SataQiu) [SIG Cluster Lifecycle]
On single-stack configured (IPv4 or IPv6, but not both) clusters, Services which are both headless (no clusterIP) and selectorless (empty or undefined selector) will report ipFamilyPolicy RequireDualStack and will have entries in ipFamilies[] for both IPv4 and IPv6. This is a change from alpha, but does not have any impact on the manually-specified Endpoints and EndpointSlices for the Service. (#99555, @thockin) [SIG Apps and Network]
Resolves spurious Failed to list *v1.Secret or Failed to list *v1.ConfigMap messages in kubelet logs. (#99538, @liggitt) [SIG Auth and Node]
Return zero time (midnight on Jan. 1, 1970) instead of negative number when reporting startedAt and finishedAt of the not started or a running Pod when using dockershim as a runtime. (#99585, @Iceber) [SIG Node]
Stdin is now only passed to client-go exec credential plugins when it is detected to be an interactive terminal. Previously, it was passed to client-go exec plugins when *stdout- was detected to be an interactive terminal. (#99654, @ankeesler) [SIG API Machinery and Auth]
The maximum number of ports allowed in EndpointSlices has been increased from 100 to 20,000 (#99795, @robscott) [SIG Network]
Updates the commands
kubectl kustomize {arg}
kubectl apply -k {arg}
to use same code as kustomize CLI v4.0.5
When a CNI plugin returns dual-stack pod IPs, kubelet will now try to respect the
"primary IP family" of the cluster by picking a primary pod IP of the same family
as the (primary) node IP, rather than assuming that the CNI plugin returned the IPs
in the order the administrator wanted (since some CNI plugins don't allow
configuring this). (#97979, @danwinship) [SIG Network and Node]
When using Containerd on Windows, the "C:\Windows\System32\drivers\etc\hosts" file will now be managed by kubelet. (#83730, @claudiubelu) [SIG Node and Windows]
VolumeBindingArgs now allow BindTimeoutSeconds to be set as zero, while the value zero indicates no waiting for the checking of volume binding operation. (#99835, @chendave) [SIG Scheduling and Storage]
kubectl exec and kubectl attach now honor the --quiet flag which suppresses output from the local binary that could be confused by a script with the remote command output (all non-failure output is hidden). In addition, print inline with exec and attach the list of alternate containers when we default to the first spec.container. (#99004, @smarterclayton) [SIG CLI]
Other (Cleanup or Flake)
Apiserver_request_duration_seconds is promoted to stable status. (#99925, @logicalhan) [SIG API Machinery, Instrumentation and Testing]
Apiserver_request_total is promoted to stable status and no longer has a content-type dimensions, so any alerts/charts which presume the existence of this will fail. This is however, unlikely to be the case since it was effectively an unbounded dimension in the first place. (#99788, @logicalhan) [SIG API Machinery, Instrumentation and Testing]
EndpointSlice generation is now incremented when labels change. (#99750, @robscott) [SIG Network]
Featuregate AllowInsecureBackendProxy is promoted to GA (#99658, @deads2k) [SIG API Machinery]
Migrate pkg/kubelet/cloudresource to structured logging (#98999, @sladyn98) [SIG Node]
Migrate pkg/kubelet/cri/remote logs to structured logging (#98589, @chenyw1990) [SIG Node]
Migrate pkg/kubelet/kuberuntime/kuberuntime_container.go logs to structured logging (#96973, @chenyw1990) [SIG Instrumentation and Node]
Migrate pkg/kubelet/status to structured logging (#99836, @navidshaikh) [SIG Instrumentation and Node]
Migrate pkg/kubelet/token to structured logging (#99264, @palnabarun) [SIG Auth, Instrumentation and Node]
Migrate pkg/kubelet/util to structured logging (#99823, @navidshaikh) [SIG Instrumentation and Node]
Migrate proxy/userspace/proxier.go logs to structured logging (#97837, @JornShen) [SIG Network]
Migrate some kubelet/metrics log messages to structured logging (#98627, @jialaijun) [SIG Instrumentation and Node]
Process start time on Windows now uses current process information (#97491, @jsturtevant) [SIG API Machinery, CLI, Cloud Provider, Cluster Lifecycle, Instrumentation and Windows]
Uncategorized
Migrate pkg/kubelet/stats to structured logging (#99607, @krzysiekg) [SIG Node]
The DownwardAPIHugePages feature is beta. Users may use the feature if all workers in their cluster are min 1.20 version. The feature will be enabled by default in all installations in 1.22. (#99610, @derekwaynecarr) [SIG Node]
(No, really, you MUST read this before you upgrade)
The metric storage_operation_errors_total is not removed, but is marked deprecated, and the metric storage_operation_status_count is marked deprecated. In both cases the storage_operation_duration_seconds metric can be used to recover equivalent counts (using status=fail-unknown in the case of storage_operations_errors_total). (#99045, @mattcary) [SIG Instrumentation and Storage]
Changes by Kind
Deprecation
The batch/v2alpha1 CronJob type definitions and clients are deprecated and removed. (#96987, @soltysh) [SIG API Machinery, Apps, CLI and Testing]
API Change
Cluster admins can now turn off /debug/pprof and /debug/flags/v endpoint in kubelet by setting enableProfilingHandler and enableDebugFlagsHandler to false in their kubelet configuration file. enableProfilingHandler and enableDebugFlagsHandler can be set to true only when enableDebuggingHandlers is also set to true. (#98458, @SaranBalaji90) [SIG Node]
The BoundServiceAccountTokenVolume feature has been promoted to beta, and enabled by default.
This changes the tokens provided to containers at /var/run/secrets/kubernetes.io/serviceaccount/token to be time-limited, auto-refreshed, and invalidated when the containing pod is deleted.
Clients should reload the token from disk periodically (once per minute is recommended) to ensure they continue to use a valid token. k8s.io/client-go version v11.0.0+ and v0.15.0+ reload tokens automatically.
By default, injected tokens are given an extended lifetime so they remain valid even after a new refreshed token is provided. The metric serviceaccount_stale_tokens_total can be used to monitor for workloads that are depending on the extended lifetime and are continuing to use tokens even after a refreshed token is provided to the container. If that metric indicates no existing workloads are depending on extended lifetimes, injected token lifetime can be shortened to 1 hour by starting kube-apiserver with --service-account-extend-token-expiration=false. (#95667, @zshihang) [SIG API Machinery, Auth, Cluster Lifecycle and Testing]
Feature
A new histogram metric to track the time it took to delete a job by the ttl-after-finished controller (#98676, @ahg-g) [SIG Apps and Instrumentation]
AWS cloudprovider supports auto-discovering subnets without any kubernetes.io/cluster/ tags. It also supports additional service annotation service.beta.kubernetes.io/aws-load-balancer-subnets to manually configure the subnets. (#97431, @kishorj) [SIG Cloud Provider]
Add --permit-address-sharing flag to kube-apiserver to listen with SO_REUSEADDR. While allowing to listen on wildcard IPs like 0.0.0.0 and specific IPs in parallel, it avoid waiting for the kernel to release socket in TIME_WAIT state, and hence, considably reducing kube-apiserver restart times under certain conditions. (#93861, @sttts) [SIG API Machinery]
Add csi_operations_seconds metric on kubelet that exposes CSI operations duration and status for node CSI operations. (#98979, @Jiawei0227) [SIG Instrumentation and Storage]
Add migrated field into storage_operation_duration_seconds metric (#99050, @Jiawei0227) [SIG Apps, Instrumentation and Storage]
Add bash-completion for comma separated list on kubectl get (#98301, @phil9909) [SIG CLI]
Added support for installing arm64 node artifacts. (#99242, @liu-cong) [SIG Cloud Provider]
Feature gate RootCAConfigMap is graduated to GA in 1.21 and will be removed in 1.22. (#98033, @zshihang) [SIG API Machinery and Auth]
Kubeadm: during "init" and "join" perform preflight validation on the host / node name and throw warnings if a name is not compliant (#99194, @pacoxu) [SIG Cluster Lifecycle]
Kubectl: kubectl get will omit managed fields by default now. Users could set --show-managed-fields to true to show managedFields when the output format is either json or yaml. (#96878, @knight42) [SIG CLI and Testing]
Metrics can now be disabled explicitly via a command line flag (i.e. '--disabled-metrics=bad_metric1,bad_metric2') (#99217, @logicalhan) [SIG API Machinery, Cluster Lifecycle and Instrumentation]
TTLAfterFinished is now beta and enabled by default (#98678, @ahg-g) [SIG Apps and Auth]
The RunAsGroup feature has been promoted to GA in this release. (#94641, @krmayankk) [SIG Auth and Node]
Turn CronJobControllerV2 on by default. (#98878, @soltysh) [SIG Apps]
UDP protocol support for Agnhost connect subcommand (#98639, @knabben) [SIG Testing]
Fix ALPHA stability level reference link (#98641, @Jeffwan) [SIG Auth, Cloud Provider, Instrumentation and Storage]
Failing Test
Escape the special characters like [, ] and that exist in vsphere windows path (#98830, @liyanhui1228) [SIG Storage and Windows]
Kube-proxy: fix a bug on UDP NodePort Services where stale conntrack entries may blackhole the traffic directed to the NodePort. (#98305, @aojea) [SIG Network]
Bug or Regression
Add missing --kube-api-content-type in kubemark hollow template (#98911, @Jeffwan) [SIG Scalability and Testing]
Avoid duplicate error messages when running kubectl edit quota (#98201, @pacoxu) [SIG API Machinery and Apps]
Cleanup subnet in frontend IP configs to prevent huge subnet request bodies in some scenarios. (#98133, @nilo19) [SIG Cloud Provider]
Fix errors when accessing Windows container stats for Dockershim (#98510, @jsturtevant) [SIG Node and Windows]
Fixes spurious errors about IPv6 in kube-proxy logs on nodes with IPv6 disabled. (#99127, @danwinship) [SIG Network and Node]
In the method that ensures that the docker and containerd are in the correct containers with the proper OOM score set up, fixed the bug of identifying containerd process. (#97888, @pacoxu) [SIG Node]
Kubelet now cleans up orphaned volume directories automatically (#95301, @lorenz) [SIG Node and Storage]
When dynamically provisioning Azure File volumes for a premium account, the requested size will be set to 100GB if the request is initially lower than this value to accommodate Azure File requirements. (#99122, @huffmanca) [SIG Cloud Provider and Storage]
Other (Cleanup or Flake)
APIs for kubelet annotations and labels from k8s.io/kubernetes/pkg/kubelet/apis are now available under k8s.io/kubelet/pkg/apis/ (#98931, @michaelbeaumont) [SIG Apps, Auth and Node]
Migrate pkg/kubelet/(pod, pleg) to structured logging (#98990, @gjkim42) [SIG Instrumentation and Node]
Migrate pkg/kubelet/nodestatus to structured logging (#99001, @QiWang19) [SIG Node]
Migrate pkg/kubelet/server logs to structured logging (#98643, @chenyw1990) [SIG Node]
Migrate proxy/winkernel/proxier.go logs to structured logging (#98001, @JornShen) [SIG Network and Windows]
Migrate scheduling_queue.go to structured logging (#98358, @tanjing2020) [SIG Scheduling]
Several flags related to the deprecated dockershim which are present in the kubelet command line are now deprecated. (#98730, @dims) [SIG Node]
The deprecated feature gates CSIDriverRegistry, BlockVolume and CSIBlockVolume are now unconditionally enabled and can no longer be specified in component invocations. (#98021, @gavinfish) [SIG Storage]
(No, really, you MUST read this before you upgrade)
Newly provisioned PVs by gce-pd will no longer have the beta FailureDomain label. gce-pd volume plugin will start to have GA topology label instead. (#98700, @Jiawei0227) [SIG Cloud Provider, Storage and Testing]
Remove alpha CSIMigrationXXComplete flag and add alpha InTreePluginXXUnregister flag. Deprecate CSIMigrationvSphereComplete flag and it will be removed in 1.22. (#98243, @Jiawei0227) [SIG Node and Storage]
Changes by Kind
API Change
Adds support for portRange / EndPort in Network Policy (#97058, @rikatz) [SIG Apps and Network]
Fixes using server-side apply with APIService resources (#98576, @kevindelgado) [SIG API Machinery, Apps and Testing]
Kubernetes is now built using go1.15.7 (#98363, @cpanato) [SIG Cloud Provider, Instrumentation, Node, Release and Testing]
Scheduler extender filter interface now can report unresolvable failed nodes in the new field FailedAndUnresolvableNodes of ExtenderFilterResult struct. Nodes in this map will be skipped in the preemption phase. (#92866, @cofyc) [SIG Scheduling]
Feature
A lease can only attach up to 10k objects. (#98257, @lingsamuel) [SIG API Machinery]
Add ignore-errors flag for drain, support none-break drain in group (#98203, @yuzhiquan) [SIG CLI]
Base-images: Update to debian-iptables:buster-v1.4.0
Export NewDebuggingRoundTripper function and DebugLevel options in the k8s.io/client-go/transport package. (#98324, @atosatto) [SIG API Machinery]
Kubectl wait ensures that observedGeneration >= generation if applicable (#97408, @KnicKnic) [SIG CLI]
Kubernetes is now built using go1.15.8 (#98834, @cpanato) [SIG Cloud Provider, Instrumentation, Release and Testing]
New admission controller "denyserviceexternalips" is available. Clusters which do not *need- the Service "externalIPs" feature should enable this controller and be more secure. (#97395, @thockin) [SIG API Machinery]
Overall, enable the feature of PreferNominatedNode will improve the performance of scheduling where preemption might frequently happen, but in theory, enable the feature of PreferNominatedNode, the pod might not be scheduled to the best candidate node in the cluster. (#93179, @chendave) [SIG Scheduling and Testing]
Pause image upgraded to 3.4.1 in kubelet and kubeadm for both Linux and Windows. (#98205, @pacoxu) [SIG CLI, Cloud Provider, Cluster Lifecycle, Node, Testing and Windows]
The ServiceAccountIssuerDiscovery feature has graduated to GA, and is unconditionally enabled. The ServiceAccountIssuerDiscovery feature-gate will be removed in 1.22. (#98553, @mtaufen) [SIG API Machinery, Auth and Testing]
Documentation
Feat: azure file migration go beta in 1.21. Feature gates CSIMigration to Beta (on by default) and CSIMigrationAzureFile to Beta (off by default since it requires installation of the AzureFile CSI Driver)
The in-tree AzureFile plugin "kubernetes.io/azure-file" is now deprecated and will be removed in 1.23. Users should enable CSIMigration + CSIMigrationAzureFile features and install the AzureFile CSI Driver (https://github.com/kubernetes-sigs/azurefile-csi-driver) to avoid disruption to existing Pod and PVC objects at that time.
Users should start using the AzureFile CSI Driver directly for any new volumes. (#96293, @andyzhangx) [SIG Cloud Provider]
Failing Test
Kubelet: the HostPort implementation in dockershim was not taking into consideration the HostIP field, causing that the same HostPort can not be used with different IP addresses.
This bug causes the conformance test "HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol" to fail. (#98755, @aojea) [SIG Cloud Provider, Network and Node]
Bug or Regression
Fix NPE in ephemeral storage eviction (#98261, @wzshiming) [SIG Node]
Fixed a bug that on k8s nodes, when the policy of INPUT chain in filter table is not ACCEPT, healthcheck nodeport would not work.
Added iptables rules to allow healthcheck nodeport traffic. (#97824, @hanlins) [SIG Network]
Fixed kube-proxy container image architecture for non amd64 images. (#98526, @saschagrunert) [SIG API Machinery, Release and Testing]
Fixed provisioning of Cinder volumes migrated to CSI when StorageClass with AllowedTopologies was used. (#98311, @jsafrane) [SIG Storage]
Fixes a panic in the disruption budget controller for PDB objects with invalid selectors (#98750, @mortent) [SIG Apps]
Fixes connection errors when using --volume-host-cidr-denylist or --volume-host-allow-local-loopback (#98436, @liggitt) [SIG Network and Storage]
If the user specifies an invalid timeout in the request URL, the request will be aborted with an HTTP 400.
in cases where the client specifies a timeout in the request URL, the overall request deadline is shortened now since the deadline is setup as soon as the request is received by the apiserver. (#96901, @tkashem) [SIG API Machinery and Testing]
Kubeadm: Some text in the kubeadm upgrade plan output has changed. If you have scripts or other automation that parses this output, please review these changes and update your scripts to account for the new output. (#98728, @stmcginnis) [SIG Cluster Lifecycle]
Kubeadm: fix a bug where external credentials in an existing admin.conf prevented the CA certificate to be written in the cluster-info ConfigMap. (#98882, @kvaps) [SIG Cluster Lifecycle]
Kubeadm: fix bad token placeholder text in "config print *-defaults --help" (#98839, @Mattias-) [SIG Cluster Lifecycle]
Kubeadm: get k8s CI version markers from k8s infra bucket (#98836, @hasheddan) [SIG Cluster Lifecycle and Release]
Mitigate CVE-2020-8555 for kube-up using GCE by preventing local loopback folume hosts. (#97934, @mattcary) [SIG Cloud Provider and Storage]
Remove CSI topology from migrated in-tree gcepd volume. (#97823, @Jiawei0227) [SIG Cloud Provider and Storage]
Sync node status during kubelet node shutdown.
Adds an pod admission handler that rejects new pods when the node is in progress of shutting down. (#98005, @wzshiming) [SIG Node]
Truncates a message if it hits the NoteLengthLimit when the scheduler records an event for the pod that indicates the pod has failed to schedule. (#98715, @carlory) [SIG Scheduling]
We will no longer automatically delete all data when a failure is detected during creation of the volume data file on a CSI volume. Now we will only remove the data file and volume path. (#96021, @huffmanca) [SIG Storage]
Other (Cleanup or Flake)
Fix the description of command line flags that can override --config (#98254, @changshuchao) [SIG Scheduling]
Migrate staging/src/k8s.io/apiserver/pkg/admission logs to structured logging (#98138, @lala123912) [SIG API Machinery]
Resolves flakes in the Ingress conformance tests due to conflicts with controllers updating the Ingress object (#98430, @liggitt) [SIG Network and Testing]
The default delegating authorization options now allow unauthenticated access to healthz, readyz, and livez. A system:masters user connecting to an authz delegator will not perform an authz check. (#98325, @deads2k) [SIG API Machinery, Auth, Cloud Provider and Scheduling]
The e2e suite can be instructed not to wait for pods in kube-system to be ready or for all nodes to be ready by passing --allowed-not-ready-nodes=-1 when invoking the e2e.test program. This allows callers to run subsets of the e2e suite in scenarios other than perfectly healthy clusters. (#98781, @smarterclayton) [SIG Testing]
The feature gates WindowsGMSA and WindowsRunAsUserName that are GA since v1.18 are now removed. (#96531, @ialidzhikov) [SIG Node and Windows]
The new -gce-zones flag on the e2e.test binary instructs tests that check for information about how the cluster interacts with the cloud to limit their queries to the provided zone list. If not specified, the current behavior of asking the cloud provider for all available zones in multi zone clusters is preserved. (#98787, @smarterclayton) [SIG API Machinery, Cluster Lifecycle and Testing]
(No, really, you MUST read this before you upgrade)
Remove storage metrics storage_operation_errors_total, since we already have storage_operation_status_count.And add new field status for storage_operation_duration_seconds, so that we can know about all status storage operation latency. (#98332, @JornShen) [SIG Instrumentation and Storage]
Changes by Kind
Deprecation
Remove the TokenRequest and TokenRequestProjection feature gates (#97148, @wawa0210) [SIG Node]
Removing experimental windows container hyper-v support with Docker (#97141, @wawa0210) [SIG Node and Windows]
The export query parameter (inconsistently supported by API resources and deprecated in v1.14) is fully removed. Requests setting this query parameter will now receive a 400 status response. (#98312, @deads2k) [SIG API Machinery, Auth and Testing]
API Change
Enable SPDY pings to keep connections alive, so that kubectl exec and kubectl port-forward won't be interrupted. (#97083, @knight42) [SIG API Machinery and CLI]
Documentation
Official support to build kubernetes with docker-machine / remote docker is removed. This change does not affect building kubernetes with docker locally. (#97935, @adeniyistephen) [SIG Release and Testing]
Set kubelet option --volume-stats-agg-period to negative value to disable volume calculations. (#96675, @pacoxu) [SIG Node]
Bug or Regression
Clean ReplicaSet by revision instead of creation timestamp in deployment controller (#97407, @waynepeking348) [SIG Apps]
Ensure that client-go's EventBroadcaster is safe (non-racy) during shutdown. (#95664, @DirectXMan12) [SIG API Machinery]
Fix kubelet from panic after getting the wrong signal (#98200, @wzshiming) [SIG Node]
Fix repeatedly acquire the inhibit lock (#98088, @wzshiming) [SIG Node]
Fixed a bug that the kubelet cannot start on BtrfS. (#98042, @gjkim42) [SIG Node]
Fixed an issue with garbage collection failing to clean up namespaced children of an object also referenced incorrectly by cluster-scoped children (#98068, @liggitt) [SIG API Machinery and Apps]
Fixed no effect namespace when exposing deployment with --dry-run=client. (#97492, @masap) [SIG CLI]
Fixing a bug where a failed node may not have the NoExecute taint set correctly (#96876, @howieyuen) [SIG Apps and Node]
Indentation of Resource Quota block in kubectl describe namespaces output gets correct. (#97946, @dty1er) [SIG CLI]
KUBECTL_EXTERNAL_DIFF now accepts equal sign for additional parameters. (#98158, @dougsland) [SIG CLI]
Kubeadm: fix a bug where "kubeadm join" would not properly handle missing names for existing etcd members. (#97372, @ihgann) [SIG Cluster Lifecycle]
Kubelet should ignore cgroup driver check on Windows node. (#97764, @pacoxu) [SIG Node and Windows]
Make podTopologyHints protected by lock (#95111, @choury) [SIG Node]
Readjust kubelet_containers_per_pod_count bucket (#98169, @wawa0210) [SIG Instrumentation and Node]
Scores from InterPodAffinity have stronger differentiation. (#98096, @leileiwan) [SIG Scheduling]
Specifying the KUBE_TEST_REPO environment variable when e2e tests are executed will instruct the test infrastructure to load that image from a location within the specified repo, using a predefined pattern. (#93510, @smarterclayton) [SIG Testing]
Static pods will be deleted gracefully. (#98103, @gjkim42) [SIG Node]
Use network.Interface.VirtualMachine.ID to get the binded VM
Skip standalone VM when reconciling LoadBalancer (#97635, @nilo19) [SIG Cloud Provider]
Other (Cleanup or Flake)
Kubeadm: change the default image repository for CI images from 'gcr.io/kubernetes-ci-images' to 'gcr.io/k8s-staging-ci-images' (#97087, @SataQiu) [SIG Cluster Lifecycle]
Migrate generic_scheduler.go and types.go to structured logging. (#98134, @tanjing2020) [SIG Scheduling]
Migrate proxy/winuserspace/proxier.go logs to structured logging (#97941, @JornShen) [SIG Network]
Migrate staging/src/k8s.io/apiserver/pkg/audit/policy/reader.go logs to structured logging. (#98252, @lala123912) [SIG API Machinery and Auth]
Migrate staging\src\k8s.io\apiserver\pkg\endpoints logs to structured logging (#98093, @lala123912) [SIG API Machinery]
Node (#96552, @pandaamanda) [SIG Apps, Cloud Provider, Node and Scheduling]
The kubectl alpha debug command was scheduled to be removed in v1.21. (#98111, @pandaamanda) [SIG CLI]
(No, really, you MUST read this before you upgrade)
Kube-proxy's IPVS proxy mode no longer sets the net.ipv4.conf.all.route_localnet sysctl parameter. Nodes upgrading will have net.ipv4.conf.all.route_localnet set to 1 but new nodes will inherit the system default (usually 0). If you relied on any behavior requiring net.ipv4.conf.all.route_localnet, you must set ensure it is enabled as kube-proxy will no longer set it automatically. This change helps to further mitigate CVE-2020-8558. (#92938, @lbernail) [SIG Network and Release]
Changes by Kind
Deprecation
Deprecate the topologyKeys field in Service. This capability will be replaced with upcoming work around Topology Aware Subsetting and Service Internal Traffic Policy. (#96736, @andrewsykim) [SIG Apps]
Kubeadm: graduate the command kubeadm alpha kubeconfig user to kubeadm kubeconfig user. The kubeadm alpha kubeconfig user command is deprecated now. (#97583, @knight42) [SIG Cluster Lifecycle]
Kubeadm: the "kubeadm alpha certs" command is removed now, please use "kubeadm certs" instead. (#97706, @knight42) [SIG Cluster Lifecycle]
Remove the deprecated metrics "scheduling_algorithm_preemption_evaluation_seconds" and "binding_duration_seconds", suggest to use "scheduler_framework_extension_point_duration_seconds" instead. (#96447, @chendave) [SIG Cluster Lifecycle, Instrumentation, Scheduling and Testing]
The PodSecurityPolicy API is deprecated in 1.21, and will no longer be served starting in 1.25. (#97171, @deads2k) [SIG Auth and CLI]
API Change
Change the APIVersion proto name of BoundObjectRef from aPIVersion to apiVersion. (#97379, @kebe7jun) [SIG Auth]
Promote Immutable Secrets/ConfigMaps feature to Stable.
This allows to set Immutable field in Secrets or ConfigMap object to mark their contents as immutable. (#97615, @wojtek-t) [SIG Apps, Architecture, Node and Testing]
Feature
Add flag --lease-max-object-size and metric etcd_lease_object_counts for kube-apiserver to config and observe max objects attached to a single etcd lease. (#97480, @lingsamuel) [SIG API Machinery, Instrumentation and Scalability]
Add flag --lease-reuse-duration-seconds for kube-apiserver to config etcd lease reuse duration. (#97009, @lingsamuel) [SIG API Machinery and Scalability]
Adds the ability to pass --strict-transport-security-directives to the kube-apiserver to set the HSTS header appropriately. Be sure you understand the consequences to browsers before setting this field. (#96502, @249043822) [SIG Auth]
Kubeadm now includes CoreDNS v1.8.0. (#96429, @rajansandeep) [SIG Cluster Lifecycle]
Kubeadm: add support for certificate chain validation. When using kubeadm in external CA mode, this allows an intermediate CA to be used to sign the certificates. The intermediate CA certificate must be appended to each signed certificate for this to work correctly. (#97266, @robbiemcmichael) [SIG Cluster Lifecycle]
Kubeadm: amend the node kernel validation to treat CGROUP_PIDS, FAIR_GROUP_SCHED as required and CFS_BANDWIDTH, CGROUP_HUGETLB as optional (#96378, @neolit123) [SIG Cluster Lifecycle and Node]
The Kubernetes pause image manifest list now contains an image for Windows Server 20H2. (#97322, @claudiubelu) [SIG Windows]
The apimachinery util/net function used to detect the bind address ResolveBindAddress()
takes into consideration global ip addresses on loopback interfaces when:
- the host has default routes
- there are no global IPs on those interfaces.
in order to support more complex network scenarios like BGP Unnumbered RFC 5549 (#95790, @aojea) [SIG Network]
Bug or Regression
Changelog
General
Fix priority expander falling back to a random choice even though there is a higher priority option to choose
Clone kubernetes/kubernetes in update-vendor.sh shallowly, instead of fetching all revisions
Speed up binpacking by reducing the number of PreFilter calls (call once per pod instead of #pods*#nodes times)
Speed up finding unneeded nodes by 5x+ in very large clusters by reducing the number of PreFilter calls
Expose --max-nodes-total as a metric
Errors in IncreaseSize changed from type apiError to cloudProviderError
Make build-in-docker and test-in-docker work on Linux systems with SELinux enabled
Fix an error where existing nodes were not considered as destinations while finding place for pods in scale-down simulations
Remove redundant log lines and reduce severity around parsing kubeEnv
Don't treat nodes created by virtual kubelet as nodes from non-autoscaled node groups
Remove redundant logging around calculating node utilization
Add configurable --network and --rm flags for docker in Makefile
Subtract DaemonSet pods' requests from node allocatable in the denominator while computing node utilization
Include taints by condition when determining if a node is unready/still starting
Fix update-vendor.sh to work on OSX and zsh
Add best-effort eviction for DaemonSet pods while scaling down non-empty nodes
Add build support for ARM64
AliCloud
Add missing daemonsets and replicasets to ALI example cluster role
Apache CloudStack
Add support for Apache CloudStack
AWS
Regenerate list of EC2 instances
Fix pricing endpoint in AWS China Region
Azure
Add optional jitter on initial VMSS VM cache refresh, keep the refreshes spread over time
Serve from cache for the whole period of ongoing throttling
Fix unwanted VMSS VMs cache invalidations
Enforce setting the number of retries if cloud provider backoff is enabled
Don't update capacity if VMSS provisioning state is updating
Support allocatable resources overrides via VMSS tags
Add missing stable labels in template nodes
Proactively set instance status to deleting on node deletions
Cluster API
Migrate interaction with the API from using internal types to using Unstructured
Improve tests to work better with constrained resources
Add support for node autodiscovery
Add support for --cloud-config
Update group identifier to use for Cluster API annotations
Exoscale
Add support for Exoscale
GCE
Decrease the number of GCE Read Requests made while deleting nodes
Base pricing of custom instances on their instance family type
Add pricing information for missing machine types
Add pricing information for different GPU types
Ignore the new topology.gke.io/zone label when comparing groups
Add missing stable labels to template nodes
HuaweiCloud
Add auto scaling group support
Implement node group by AS
Implement getting desired instance number of node group
Implement increasing node group size
Implement TemplateNodeInfo
Implement caching instances
IONOS
Add support for IONOS
Kubemark
Skip non-kubemark nodes while computing node infos for node groups.
Magnum
Add Magnum support in the Cluster Autoscaler helm chart
Fix nil VMSS name when setting service to auto mode (#97366, @nilo19) [SIG Cloud Provider]
Fix the panic when kubelet registers if a node object already exists with no Status.Capacity or Status.Allocatable (#95269, @SataQiu) [SIG Node]
Fix the regression with the slow pods termination. Before this fix pods may take an additional time to terminate - up to one minute. Reversing the change that ensured that CNI resources cleaned up when the pod is removed on API server. (#97980, @SergeyKanzhelev) [SIG Node]
Fix to recover CSI volumes from certain dangling attachments (#96617, @yuga711) [SIG Apps and Storage]
Fix: azure file latency issue for metadata-heavy workloads (#97082, @andyzhangx) [SIG Cloud Provider and Storage]
Fixed FibreChannel volume plugin corrupting filesystems on detach of multipath volumes. (#97013, @jsafrane) [SIG Storage]
Fixed a bug in kubelet that will saturate CPU utilization after containerd got restarted. (#97174, @hanlins) [SIG Node]
Fixed bug in CPUManager with race on container map access (#97427, @klueska) [SIG Node]
Fixed cleanup of block devices when /var/lib/kubelet is a symlink. (#96889, @jsafrane) [SIG Storage]
GCE Internal LoadBalancer sync loop will now release the ILB IP address upon sync failure. An error in ILB forwarding rule creation will no longer leak IP addresses. (#97740, @prameshj) [SIG Cloud Provider and Network]
Ignore update pod with no new images in alwaysPullImages admission controller (#96668, @pacoxu) [SIG Apps, Auth and Node]
Kubeadm now installs version 3.4.13 of etcd when creating a cluster with v1.19 (#97244, @pacoxu) [SIG Cluster Lifecycle]
Kubeadm: avoid detection of the container runtime for commands that do not need it (#97625, @pacoxu) [SIG Cluster Lifecycle]
Kubeadm: fix a bug in the host memory detection code on 32bit Linux platforms (#97403, @abelbarrera15) [SIG Cluster Lifecycle]
Kubeadm: fix a bug where "kubeadm upgrade" commands can fail if CoreDNS v1.8.0 is installed. (#97919, @neolit123) [SIG Cluster Lifecycle]
Remove deprecated --cleanup-ipvs flag of kube-proxy, and make --cleanup flag always to flush IPVS (#97336, @maaoBit) [SIG Network]
The current version of the container image publicly exposed IP serving a /metrics endpoint to the Internet. The new version of the container image serves /metrics endpoint on a different port. (#97621, @vbannai) [SIG Cloud Provider]
Use force unmount for NFS volumes if regular mount fails after 1 minute timeout (#96844, @gnufied) [SIG Storage]
Users will see increase in time for deletion of pods and also guarantee that removal of pod from api server would mean deletion of all the resources from container runtime. (#92817, @kmala) [SIG Node]
Using exec auth plugins with kubectl no longer results in warnings about constructing many client instances from the same exec auth config. (#97857, @liggitt) [SIG API Machinery and Auth]
Warning about using a deprecated volume plugin is logged only once. (#96751, @jsafrane) [SIG Storage]
Other (Cleanup or Flake)
Bump github.com/Azure/go-autorest/autorest to v0.11.12 (#97033, @patrickshan) [SIG API Machinery, CLI, Cloud Provider and Cluster Lifecycle]
Kube-proxy: Traffic from the cluster directed to ExternalIPs is always sent directly to the Service. (#96296, @aojea) [SIG Network and Testing]
Kubeadm: fix a whitespace issue in the output of the "kubeadm join" command shown as the output of "kubeadm init" and "kubeadm token create --print-join-command" (#97413, @SataQiu) [SIG Cluster Lifecycle]
Kubeadm: improve the error messaging when the user provides an invalid discovery token CA certificate hash. (#97290, @neolit123) [SIG Cluster Lifecycle]
Migrate log messages in pkg/scheduler/{scheduler.go,factory.go} to structured logging (#97509, @aldudko) [SIG Scheduling]
Migrate proxy/iptables/proxier.go logs to structured logging (#97678, @JornShen) [SIG Network]
Migrate some scheduler log messages to structured logging (#97349, @aldudko) [SIG Scheduling]
NetworkPolicy validation framework optimizations for rapidly verifying CNI's work correctly across several pods and namespaces (#91592, @jayunit100) [SIG Network, Storage and Testing]
Official support to build kubernetes with docker-machine / remote docker is removed. This change does not affect building kubernetes with docker locally. (#97618, @jherrera123) [SIG Release and Testing]
Scheduler plugin validation now provides all errors detected instead of the first one. (#96745, @lingsamuel) [SIG Node, Scheduling and Testing]
Storage related e2e testsuite redesign & cleanup (#96573, @Jiawei0227) [SIG Storage and Testing]
The OIDC authenticator no longer waits 10 seconds before attempting to fetch the metadata required to verify tokens. (#97693, @enj) [SIG API Machinery and Auth]
The AttachVolumeLimit feature gate that is GA since v1.17 is now removed. (#96539, @ialidzhikov) [SIG Storage]
The CSINodeInfo feature gate that is GA since v1.17 is unconditionally enabled, and can no longer be specified via the --feature-gates argument. (#96561, @ialidzhikov) [SIG Apps, Auth, Scheduling, Storage and Testing]
The deprecated feature gates RotateKubeletClientCertificate, AttachVolumeLimit, VolumePVCDataSource and EvenPodsSpread are now unconditionally enabled and can no longer be specified in component invocations. (#97306, @gavinfish) [SIG Node, Scheduling and Storage]
ServiceNodeExclusion, NodeDisruptionExclusion and LegacyNodeRoleBehavior(locked to false) features have been promoted to GA.
To prevent control plane nodes being added to load balancers automatically, upgrade users need to add "node.kubernetes.io/exclude-from-external-load-balancers" label to control plane nodes. (#97543, @pacoxu) [SIG API Machinery, Apps, Cloud Provider and Network]
Uncategorized
Adding Brazilian Portuguese translation for kubectl (#61595, @cpanato) [SIG CLI]
error: failed to run Kubelet: failed to create kubelet:
misconfiguration: kubelet cgroup driver: "systemd" is different from docker cgroup driver: "cgroupfs"
# kubectl get pods
Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")
在某些情况下 kubectl logs 和 kubectl run 命令或许会返回以下错误,即便除此之外集群一切功能正常:
Error from server: Get https://10.19.0.41:10250/containerLogs/default/mysql-ddc65b868-glc5m/mysql: dial tcp 10.19.0.41:10250: getsockopt: no route to host
这或许是由于 Kubernetes 使用的 IP 无法与看似相同的子网上的其他 IP 进行通信的缘故,
可能是由机器提供商的政策所导致的。
Digital Ocean 既分配一个共有 IP 给 eth0,也分配一个私有 IP 在内部用作其浮动 IP 功能的锚点,
然而 kubelet 将选择后者作为节点的 InternalIP 而不是公共 IP
使用 ip addr show 命令代替 ifconfig 命令去检查这种情况,因为 ifconfig 命令
不会显示有问题的别名 IP 地址。或者指定的 Digital Ocean 的 API 端口允许从 droplet 中
查询 anchor IP:
在云环境场景中,可能出现在云控制管理器完成节点地址初始化之前,kube-proxy 就被调度到新节点了。
这会导致 kube-proxy 无法正确获取节点的 IP 地址,并对管理负载平衡器的代理功能产生连锁反应。
在 kube-proxy Pod 中可以看到以下错误:
server.go:610] Failed to retrieve node IP: host IP unknown; known addresses: []
proxier.go:340] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
You should now deploy a Pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
/docs/concepts/cluster-administration/addons/
You can now join any number of machines by running the following on each node
as root:
kubeadm join <control-plane-host>:<control-plane-port> --token <token> --discovery-token-ca-cert-hash sha256:<hash>
[preflight] Running pre-flight checks
... (log output of join workflow) ...
Node join complete:
* Certificate signing request sent to control-plane and response
received.
* Kubelet informed of new secure connection details.
Run 'kubectl get nodes' on control-plane to see this machine join.
几秒钟后,当你在控制平面节点上执行 kubectl get nodes,你会注意到该节点出现在输出中。
...
You can now join any number of control-plane node by running the following command on each as a root:
kubeadm join 192.168.0.200:6443 --token 9vr73a.a8uxyaju799qwdjv --discovery-token-ca-cert-hash sha256:7c2e69131a36ae2a042a339b33381c6d0d43887e2de83720eff5359e26aec866 --control-plane --certificate-key f8902e114ef118304e561c3ecd4d0b543adc226b7a07f675f56564185ffe0c07
Please note that the certificate-key gives access to cluster sensitive data, keep it secret!
As a safeguard, uploaded-certs will be deleted in two hours; If necessary, you can use kubeadm init phase upload-certs to reload certs afterward.
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join 192.168.0.200:6443 --token 9vr73a.a8uxyaju799qwdjv --discovery-token-ca-cert-hash sha256:7c2e69131a36ae2a042a339b33381c6d0d43887e2de83720eff5359e26aec866
Kubespray 提供了一种使用
Netchecker
验证 Pod 间连接和 DNS 解析的方法。
Netchecker 确保 netchecker-agents pod 可以解析。
DNS 请求并在默认名称空间内对每个请求执行 ping 操作。
这些 Pods 模仿其余工作负载的类似行为,并用作集群运行状况指示器。
在很多组织中,其服务和应用的很大比例是 Windows 应用。
Windows 容器提供了一种对进程和包依赖关系
进行封装的现代方式,这使得用户更容易采用 DevOps 实践,令 Windows 应用同样遵从
云原生模式。
Kubernetes 已经成为事实上的标准容器编排器,Kubernetes 1.14 发行版本中包含了将
Windows 容器调度到 Kubernetes 集群中 Windows 节点上的生产级支持,从而使得巨大
的 Windows 应用生态圈能够充分利用 Kubernetes 的能力。
对于同时投入基于 Windows 应用和 Linux 应用的组织而言,他们不必寻找不同的编排系统
来管理其工作负载,其跨部署的运维效率得以大幅提升,而不必关心所用操作系统。
kubernetes 中的 Windows 容器
若要在 Kubernetes 中启用对 Windows 容器的编排,可以在现有的 Linux 集群中
包含 Windows 节点。在 Kubernetes 上调度 Pods
中的 Windows 容器与调用基于 Linux 的容器类似。
为了运行 Windows 容器,你的 Kubernetes 集群必须包含多个操作系统,控制面
节点运行 Linux,工作节点则可以根据负载需要运行 Windows 或 Linux。
Windows Server 2019 是唯一被支持的 Windows 操作系统,在 Windows 上启用
Kubernetes 节点
支持(包括 kubelet, 容器运行时、
以及 kube-proxy)。关于 Windows 发行版渠道的详细讨论,可参见
Microsoft 文档。
说明: Kubernetes 控制面,包括主控组件,
继续在 Linux 上运行。
目前没有支持完全是 Windows 节点的 Kubernetes 集群的计划。
说明: 在本文中,当我们讨论 Windows 容器时,我们所指的是具有进程隔离能力的 Windows
容器。具有 Hyper-V 隔离能力
的 Windows 容器计划在将来发行版本中推出。
支持的功能与局限性
支持的功能
Windows 操作系统版本支持
参考下面的表格,了解 Kubernetes 中支持的 Windows 操作系统。
同一个异构的 Kubernetes 集群中可以同时包含 Windows 和 Linux 工作节点。
Windows 容器仅能调度到 Windows 节点,Linux 容器则只能调度到 Linux 节点。
Pod 是 Kubernetes 中最基本的构造模块,是 Kubernetes 对象模型中你可以创建或部署的
最小、最简单元。你不可以在同一 Pod 中部署 Windows 和 Linux 容器。
Pod 中的所有容器都会被调度到同一节点(Node),而每个节点代表的是一种特定的平台
和体系结构。Windows 容器支持 Pod 的以下能力、属性和事件:
Windows 支持五种不同的网络驱动/模式:二层桥接(L2bridge)、二层隧道(L2tunnel)、
覆盖网络(Overlay)、透明网络(Transparent)和网络地址转译(NAT)。
在一个包含 Windows 和 Linux 工作节点的异构集群中,你需要选择一种对 Windows 和
Linux 兼容的联网方案。下面是 Windows 上支持的一些树外插件及何时使用某种
CNI 插件的建议:
网络驱动
描述
容器报文更改
网络插件
网络插件特点
L2bridge
容器挂接到外部 vSwitch 上。容器挂接到下层网络之上,但由于容器的 MAC
地址在入站和出站时被重写,物理网络不需要这些地址。
MAC 地址被重写为宿主系统的 MAC 地址,IP 地址也可能依据 HNS OutboundNAT
策略重写为宿主的 IP 地址。
当(比如出于安全原因)期望虚拟容器网络与下层宿主网络隔离时,
应该使用 win-overlay。如果你的数据中心可用 IP 地址受限,
覆盖网络允许你在不同的网络中复用 IP 地址(每个覆盖网络有不同的 VNID 标签)。
这一选项要求在 Windows Server 2009 上安装
[KB4489899](https://support.microsoft.com/help/4489899) 补丁。
对 Windows 而言,在 Kubernetes 中使用 IPv6 需要
Windows Server 2004 (内核版本 10.0.19041.610)或更高版本。
目前 Windows 上的覆盖网络(VXLAN)还不支持双协议栈联网。
局限性
在 Kubernetes 架构和节点阵列中仅支持将 Windows 作为工作节点使用。
这意味着 Kubernetes 集群必须总是包含 Linux 主控节点,零个或者多个 Linux
工作节点以及零个或者多个 Windows 工作节点。
资源处理
Linux 上使用 Linux 控制组(CGroups)作为 Pod 的边界,以实现资源控制。
容器都创建于这一边界之内,从而实现网络、进程和文件系统的隔离。
控制组 CGroups API 可用来收集 CPU、I/O 和内存的统计信息。
与此相比,Windows 为每个容器创建一个带有系统名字空间过滤设置的 Job 对象,
以容纳容器中的所有进程并提供其与宿主系统间的逻辑隔离。
没有现成的名字空间过滤设置是无法运行 Windows 容器的。
这也意味着,系统特权无法在宿主环境中评估,因而 Windows 上也就不存在特权容器。
归咎于独立存在的安全账号管理器(Security Account Manager,SAM),容器也不能
获得宿主系统上的任何身份标识。
资源预留
内存预留
Windows 不像 Linux 一样有一个内存耗尽(Out-of-memory)进程杀手(Process
Killer)机制。Windows 总是将用户态的内存分配视为虚拟请求,页面文件(Pagefile)
是必需的。这一差异的直接结果是 Windows 不会像 Linux 那样出现内存耗尽的状况,
系统会将进程内存页面写入磁盘而不会因内存耗尽而终止进程。
当内存被过量使用且所有物理内存都被用光时,系统的换页行为会导致性能下降。
为了统计 Windows、Docker 和其他 Kubernetes 宿主进程的 CPU 用量,建议
预留一定比例的 CPU,以便对事件作出相应。此值需要根据 Windows 节点上
CPU 核的个数来调整,要确定此百分比值,用户需要为其所有节点确定 Pod
密度的上线,并监控系统服务的 CPU 用量,从而选择一个符合其负载需求的值。
使用 kubelet 参数 --kubelet-reserve 与/或 -system-reserve 可以统计
节点上的 CPU 用量(各容器之外),进而可能将 CPU 用量限制在一个合理的范围,。
这样做会减少节点可分配 CPU
(NodeAllocatable)。
功能特性限制
终止宽限期(Termination Grace Period):未实现
单文件映射:将用 CRI-ContainerD 来实现
终止消息(Termination message):将用 CRI-ContainerD 来实现
特权容器:Windows 容器当前不支持
巨页(Huge Pages):Windows 容器当前不支持
现有的节点问题探测器(Node Problem Detector)仅适用于 Linux,且要求使用特权容器。
一般而言,我们不设想此探测器能用于 Windows 节点,因为 Windows 不支持特权容器。
在 Windows 节点上存在一个额外的参数用来设置 kubelet 进程的优先级,称作
--windows-priorityclass。此参数允许 kubelet 进程获得与 Windows 宿主上
其他进程相比更多的 CPU 时间片。
关于可用参数值及其含义的进一步信息可参考
Windows Priority Classes。
为了让 kubelet 总能够获得足够的 CPU 周期,建议将此参数设置为
ABOVE_NORMAL_PRIORITY_CLASS 或更高。
存储
Windows 上包含一个分层的文件系统来挂载容器的分层,并会基于 NTFS 来创建一个
拷贝文件系统。容器中的所有文件路径都仅在该容器的上下文内完成解析。
Windows 宿主联网服务和虚拟交换机实现了名字空间隔离,可以根据需要为 Pod 或容器
创建虚拟的网络接口(NICs)。不过,很多类似 DNS、路由、度量值之类的配置数据都
保存在 Windows 注册表数据库中而不是像 Linux 一样保存在 /etc/... 文件中。
Windows 为容器提供的注册表与宿主系统的注册表是分离的,因此类似于将 /etc/resolv.conf
文件从宿主系统映射到容器中的做法不会产生与 Linux 系统相同的效果。
这些信息必须在容器内部使用 Windows API 来配置。
因此,CNI 实现需要调用 HNS,而不是依赖文件映射来将网络细节传递到 Pod
或容器中。
Windows 节点不支持以下联网功能:
Windows Pod 不能使用宿主网络模式;
从节点本地访问 NodePort 会失败(但从其他节点或外部客户端可访问)
Windows Server 的未来版本中会支持从节点访问服务的 VIP;
每个服务最多支持 64 个后端 Pod 或独立的目标 IP 地址;
kube-proxy 的覆盖网络支持是 Beta 特性。此外,它要求在 Windows Server 2019 上安装
KB4482887 补丁;
非 DSR(保留目标地址)模式下的本地流量策略;
连接到覆盖网络的 Windows 容器不支持使用 IPv6 协议栈通信。
要使得这一网络驱动支持 IPv6 地址需要在 Windows 平台上开展大量的工作,
还需要在 Kubernetes 侧修改 kubelet、kube-proxy 以及 CNI 插件。
不支持 DNS 的 ClusterFirstWithHostNet 配置。Windows 将所有包含 “.” 的名字
视为全限定域名(FQDN),因而不会对其执行部分限定域名(PQDN)解析。
在 Linux 上,你可以有一个 DNS 后缀列表供解析部分限定域名时使用。
在 Windows 上,我们只有一个 DNS 后缀,即与该 Pod 名字空间相关联的 DNS
后缀(例如 mydns.svc.cluster.local)。
Windows 可以解析全限定域名、或者恰好可用该后缀来解析的服务名称。
例如,在 default 名字空间中生成的 Pod 会获得 DNS 后缀
default.svc.cluster.local。在 Windows Pod 中,你可以解析
kubernetes.default.svc.cluster.local 和 kubernetes,但无法解析二者
之间的形式,如 kubernetes.default 或 kubernetes.default.svc。
在 Windows 上,可以使用的 DNS 解析程序有很多。由于这些解析程序彼此之间
会有轻微的行为差别,建议使用 Resolve-DNSName 工具来完成名字查询解析。
IPv6
Windows 上的 Kubernetes 不支持单协议栈的“只用 IPv6”联网选项。
不过,系统支持在 IPv4/IPv6 双协议栈的 Pod 和节点上运行单协议家族的服务。
更多细节可参阅 IPv4/IPv6 双协议栈联网一节。
会话亲和性
不支持使用 service.spec.sessionAffinityConfig.clientIP.timeoutSeconds 来为
Windows 服务设置最大会话粘滞时间。
安全性
Secret 以明文形式写入节点的卷中(而不是像 Linux 那样写入内存或 tmpfs 中)。
这意味着客户必须做以下两件事:
信号(Signal) - Windows 交互式应用以不同方式来处理终止事件,并可实现以下方式之一或组合:
UI 线程处理包含 WM_CLOSE 在内的良定的消息
控制台应用使用控制处理程序来处理 Ctrl-C 或 Ctrl-Break
服务会注册服务控制处理程序,接受 SERVICE_CONTROL_STOP 控制代码
退出代码遵从相同的习惯,0 表示成功,非 0 值表示失败。
特定的错误代码在 Windows 和 Linux 上可能会不同。不过,从 Kubernetes 组件
(kubelet、kube-proxy)所返回的退出代码是没有变化的。
v1.Container.ResourceRequirements.limits.cpu 和
v1.Container.ResourceRequirements.limits.memory - Windows
不对 CPU 分配设置硬性的限制。与之相反,Windows 使用一个份额(share)系统。
基于毫核(millicores)的现有字段值会被缩放为相对的份额值,供 Windows 调度器使用。
参见 kuberuntime/helpers_windows.go 和
Microsoft 文档中关于资源控制的部分。
Windows 容器运行时中没有实现巨页支持,因此相关特性不可用。
巨页支持需要判定用户的特权
而这一特性无法在容器级别配置。
v1.PodSecurityContext.sysctls - 这些是 Linux sysctl 接口的一部分;Windows 上
没有等价机制。
操作系统版本限制
Windows 有着严格的兼容性规则,宿主 OS 的版本必须与容器基准镜像 OS 的版本匹配。
目前仅支持容器操作系统为 Windows Server 2019 的 Windows 容器。
对于容器的 Hyper-V 隔离、允许一定程度上的 Windows 容器镜像版本向后兼容性等等,
都是将来版本计划的一部分。
在一个 Kubernetes Pod 中,一个基础设施容器,或称 "pause" 容器,会被首先创建出来,
用以托管容器端点。属于同一 Pod 的容器,包括基础设施容器和工作容器,会共享相同的
网络名字空间和端点(相同的 IP 和端口空间)。我们需要 pause 容器来工作容器崩溃或
重启的状况,以确保不会丢失任何网络配置。
说明: 由于当前平台对 Windows 网络堆栈的限制,Windows 容器主机无法访问在其上调度的服务的 IP。只有 Windows pods 才能访问服务 IP。
可观测性
抓取来自工作负载的日志
日志是可观测性的重要一环;使用日志用户可以获得对负载运行状况的洞察,
因而日志是故障排查的一个重要手法。
因为 Windows 容器中的 Windows 容器和负载与 Linux 容器的行为不同,
用户很难收集日志,因此运行状态的可见性很受限。
例如,Windows 工作负载通常被配置为将日志输出到 Windows 事件跟踪
(Event Tracing for Windows,ETW),或者将日志条目推送到应用的事件日志中。
LogMonitor
是 Microsoft 提供的一个开源工具,是监视 Windows 容器中所配置的日志源
的推荐方式。
LogMonitor 支持监视时间日志、ETW 提供者模块以及自定义的应用日志,
并使用管道的方式将其输出到标准输出(stdout),以便 kubectl logs <pod>
这类命令能够读取这些数据。
从 Kubernetes v1.16 开始,可以为 Windows 容器配置与其镜像默认值不同的用户名
来运行其入口点和进程。
此能力的实现方式和 Linux 容器有些不同。
在此处
可了解更多信息。
使用组托管服务帐户管理工作负载身份
从 Kubernetes v1.14 开始,可以将 Windows 容器工作负载配置为使用组托管服务帐户(GMSA)。
组托管服务帐户是 Active Directory 帐户的一种特定类型,它提供自动密码管理,
简化的服务主体名称(SPN)管理以及将管理委派给跨多台服务器的其他管理员的功能。
配置了 GMSA 的容器可以访问外部 Active Directory 域资源,同时携带通过 GMSA 配置的身份。
在此处了解有关为
Windows 容器配置和使用 GMSA 的更多信息。
污点和容忍度
目前,用户需要将 Linux 和 Windows 工作负载运行在各自特定的操作系统的节点上,
因而需要结合使用污点和节点选择算符。 这可能仅给 Windows 用户造成不便。
推荐的方法概述如下,其主要目标之一是该方法不应破坏与现有 Linux 工作负载的兼容性。
确保特定操作系统的工作负载落在适当的容器主机上
用户可以使用污点和容忍度确保 Windows 容器可以调度在适当的主机上。目前所有 Kubernetes 节点都具有以下默认标签:
kubernetes.io/os = [windows|linux]
kubernetes.io/arch = [amd64|arm64|...]
如果 Pod 规范未指定诸如 "kubernetes.io/os": windows 之类的 nodeSelector,则该 Pod
可能会被调度到任何主机(Windows 或 Linux)上。
这是有问题的,因为 Windows 容器只能在 Windows 上运行,而 Linux 容器只能在 Linux 上运行。
最佳实践是使用 nodeSelector。
但是,我们了解到,在许多情况下,用户都有既存的大量的 Linux 容器部署,以及一个现成的配置生态系统,
例如社区 Helm charts,以及程序化 Pod 生成案例,例如 Operators。
在这些情况下,您可能会不愿意更改配置添加 nodeSelector。替代方法是使用污点。
由于 kubelet 可以在注册期间设置污点,因此可以轻松修改它,使其仅在 Windows 上运行时自动添加污点。
例如:--register-with-taints='os=windows:NoSchedule'
向所有 Windows 节点添加污点后,Kubernetes 将不会在它们上调度任何负载(包括现有的 Linux Pod)。
为了使某 Windows Pod 调度到 Windows 节点上,该 Pod 既需要 nodeSelector 选择 Windows,
也需要合适的匹配的容忍度设置。
例如,使用托管的负载均衡器时,你可以配置负载均衡器发送源自故障区域 A 中的 kubelet 和 Pod 的流量,
并将该流量仅定向到也位于区域 A 中的控制平面主机。
如果单个控制平面主机或端点故障区域 A 脱机,则意味着区域 A 中的节点的所有控制平面流量现在都在区域之间发送。
在每个区域中运行多个控制平面主机能降低出现这种结果的可能性。
[1]: 用来连接到集群的不同 IP 或 DNS 名
(就像 kubeadm 为负载均衡所使用的固定
IP 或 DNS 名,kubernetes、kubernetes.default、kubernetes.default.svc、
kubernetes.default.svc.cluster、kubernetes.default.svc.cluster.local)。
Kubernetes 这个名字源于希腊语,意为“舵手”或“飞行员”。k8s 这个缩写是因为 k 和 s 之间有八个字符的关系。
Google 在 2014 年开源了 Kubernetes 项目。Kubernetes 建立在
Google 在大规模运行生产工作负载方面拥有十几年的经验
的基础上,结合了社区中最好的想法和实践。
apiVersion:apps/v1kind:Deploymentmetadata:name:nginx-deploymentspec:selector:matchLabels:app:nginxreplicas:2# tells deployment to run 2 pods matching the templatetemplate:metadata:labels:app:nginxspec:containers:- name:nginximage:nginx:1.14.2ports:- containerPort:80
Error from server (BadRequest): Unable to find "ingresses" that match label selector "", field selector "foo.bar=baz": "foo.bar" is not a known field selector: only "metadata.name", "metadata.namespace"
在容器因 API 请求或者管理事件(诸如存活态探针、启动探针失败、资源抢占、资源竞争等)
而被终止之前,此回调会被调用。
如果容器已经处于已终止或者已完成状态,则对 preStop 回调的调用将失败。
在用来停止容器的 TERM 信号被发出之前,回调必须执行结束。
Pod 的终止宽限周期在 PreStop 回调被执行之前即开始计数,所以无论
回调函数的执行结果如何,容器最终都会在 Pod 的终止宽限期内被终止。
没有参数会被传递给处理程序。
一旦你的应用处于运行状态,你就可能想要以
Service
的形式使之可在互联网上访问;或者对于 Web 应用而言,使用
Ingress 资源将其暴露到互联网上。
3.4.1 - Pods
Pod 是可以在 Kubernetes 中创建和管理的、最小的可部署的计算单元。
Pod (就像在鲸鱼荚或者豌豆荚中)是一组(一个或多个)
容器;
这些容器共享存储、网络、以及怎样运行这些容器的声明。
Pod 中的内容总是并置(colocated)的并且一同调度,在共享的上下文中运行。
Pod 所建模的是特定于应用的“逻辑主机”,其中包含一个或多个应用容器,
这些容器是相对紧密的耦合在一起的。
在非云环境中,在相同的物理机或虚拟机上运行的应用类似于
在同一逻辑主机上运行的云应用。
除了应用容器,Pod 还可以包含在 Pod 启动期间运行的
Init 容器。
你也可以在集群中支持临时性容器
的情况下,为调试的目的注入临时性容器。
运行单个容器的 Pod。"每个 Pod 一个容器"模型是最常见的 Kubernetes 用例;
在这种情况下,可以将 Pod 看作单个容器的包装器,并且 Kubernetes 直接管理 Pod,而不是容器。
运行多个协同工作的容器的 Pod。
Pod 可能封装由多个紧密耦合且需要共享资源的共处容器组成的应用程序。
这些位于同一位置的容器可能形成单个内聚的服务单元 —— 一个容器将文件从共享卷提供给公众,
而另一个单独的“边车”(sidecar)容器则刷新或更新这些文件。
Pod 将这些容器和存储资源打包为一个可管理的实体。
说明: 将多个并置、同管的容器组织到一个 Pod 中是一种相对高级的使用场景。
只有在一些场景中,容器之间紧密关联时你才应该使用这种模式。
每个 Pod 都旨在运行给定应用程序的单个实例。如果希望横向扩展应用程序(例如,运行多个实例
以提供更多的资源),则应该使用多个 Pod,每个实例使用一个 Pod。
在 Kubernetes 中,这通常被称为 副本(Replication)。
通常使用一种工作负载资源及其控制器
来创建和管理一组 Pod 副本。
参见 Pod 和控制器以了解 Kubernetes
如何使用工作负载资源及其控制器以实现应用的扩缩和自动修复。
Pod 怎样管理多个容器
Pod 被设计成支持形成内聚服务单元的多个协作过程(形式为容器)。
Pod 中的容器被自动安排到集群中的同一物理机或虚拟机上,并可以一起进行调度。
容器之间可以共享资源和依赖、彼此通信、协调何时以及何种方式终止自身。
例如,你可能有一个容器,为共享卷中的文件提供 Web 服务器支持,以及一个单独的
“sidecar(挂斗)”容器负责从远端更新这些文件,如下图所示:
Pod 更新不可以改变除 spec.containers[*].image、spec.initContainers[*].image、
spec.activeDeadlineSeconds 或 spec.tolerations 之外的字段。
对于 spec.tolerations,你只被允许添加新的条目到其中。
在更新spec.activeDeadlineSeconds 字段时,以下两种更新操作是被允许的:
如果该字段尚未设置,可以将其设置为一个正数;
如果该字段已经设置为一个正数,可以将其设置为一个更小的、非负的整数。
资源共享和通信
Pod 使它的成员容器间能够进行数据共享和通信。
Pod 中的存储
一个 Pod 可以设置一组共享的存储卷。
Pod 中的所有容器都可以访问该共享卷,从而允许这些容器共享数据。
卷还允许 Pod 中的持久数据保留下来,即使其中的容器需要重新启动。
有关 Kubernetes 如何在 Pod 中实现共享存储并将其提供给 Pod 的更多信息,
请参考卷。
Pod 联网
每个 Pod 都在每个地址族中获得一个唯一的 IP 地址。
Pod 中的每个容器共享网络名字空间,包括 IP 地址和网络端口。
Pod 内 的容器可以使用 localhost 互相通信。
当 Pod 中的容器与 Pod 之外 的实体通信时,它们必须协调如何使用共享的网络资源
(例如端口)。
在同一个 Pod 内,所有容器共享一个 IP 地址和端口空间,并且可以通过 localhost 发现对方。
他们也能通过如 SystemV 信号量或 POSIX 共享内存这类标准的进程间通信方式互相通信。
不同 Pod 中的容器的 IP 地址互不相同,没有
特殊配置 就不能使用 IPC 进行通信。
如果某容器希望与运行于其他 Pod 中的容器通信,可以通过 IP 联网的方式实现。
Pod 中的容器所看到的系统主机名与为 Pod 配置的 name 属性值相同。
网络部分提供了更多有关此内容的信息。
容器的特权模式
Pod 中的任何容器都可以使用容器规约中的
安全性上下文中的
privileged 参数启用特权模式。
这对于想要使用操作系统管理权能(Capabilities,如操纵网络堆栈和访问设备)
的容器很有用。
容器内的进程几乎可以获得与容器外的进程相同的特权。
由于 Pod 所代表的是在集群中节点上运行的进程,当不再需要这些进程时允许其体面地
终止是很重要的。一般不应武断地使用 KILL 信号终止它们,导致这些进程没有机会
完成清理操作。
设计的目标是令你能够请求删除进程,并且知道进程何时被终止,同时也能够确保删除
操作终将完成。当你请求删除某个 Pod 时,集群会记录并跟踪 Pod 的体面终止周期,
而不是直接强制地杀死 Pod。在存在强制关闭设施的前提下,
kubelet 会尝试体面地终止
Pod。
通常情况下,容器运行时会发送一个 TERM 信号到每个容器中的主进程。
很多容器运行时都能够注意到容器镜像中 STOPSIGNAL 的值,并发送该信号而不是 TERM。
一旦超出了体面终止限期,容器运行时会向所有剩余进程发送 KILL 信号,之后
Pod 就会被从 API 服务器
上移除。如果 kubelet 或者容器运行时的管理服务在等待进程终止期间被重启,
集群会从头开始重试,赋予 Pod 完整的体面终止限期。
下面是一个例子:
你使用 kubectl 工具手动删除某个特定的 Pod,而该 Pod 的体面终止限期是默认值(30 秒)。
API 服务器中的 Pod 对象被更新,记录涵盖体面终止限期在内 Pod
的最终死期,超出所计算时间点则认为 Pod 已死(dead)。
如果你使用 kubectl describe 来查验你正在删除的 Pod,该 Pod 会显示为
"Terminating" (正在终止)。
在 Pod 运行所在的节点上:kubelet 一旦看到 Pod
被标记为正在终止(已经设置了体面终止限期),kubelet 即开始本地的 Pod 关闭过程。
如果 Pod 中的容器之一定义了 preStop回调,
kubelet 开始在容器内运行该回调逻辑。如果超出体面终止限期时,preStop 回调逻辑
仍在运行,kubelet 会请求给予该 Pod 的宽限期一次性增加 2 秒钟。
说明: Pod 中的容器会在不同时刻收到 TERM 信号,接收顺序也是不确定的。
如果关闭的顺序很重要,可以考虑使用 preStop 回调逻辑来协调。
与此同时,kubelet 启动体面关闭逻辑,控制面会将 Pod 从对应的端点列表(以及端点切片列表,
如果启用了的话)中移除,过滤条件是 Pod 被对应的
服务以某
选择算符选定。
ReplicaSets和其他工作负载资源
不再将关闭进程中的 Pod 视为合法的、能够提供服务的副本。关闭动作很慢的 Pod
也无法继续处理请求数据,因为负载均衡器(例如服务代理)已经在终止宽限期开始的时候
将其从端点列表中移除。
应用程序所有者可以为每个应用程序创建 PodDisruptionBudget 对象(PDB)。
PDB 将限制在同一时间因自愿干扰导致的复制应用程序中宕机的 pod 数量。
例如,基于票选机制的应用程序希望确保运行的副本数永远不会低于仲裁所需的数量。
Web 前端可能希望确保提供负载的副本数量永远不会低于总数的某个百分比。
集群管理员和托管提供商应该使用遵循 PodDisruptionBudgets 的接口
(通过调用Eviction API),
而不是直接删除 Pod 或 Deployment。
Name: nginx-deployment
Namespace: default
CreationTimestamp: Thu, 30 Nov 2017 10:56:25 +0000
Labels: app=nginx
Annotations: deployment.kubernetes.io/revision=2
Selector: app=nginx
Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx:1.16.1
Port: 80/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: nginx-deployment-1564180365 (3/3 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 2m deployment-controller Scaled up replica set nginx-deployment-2035384211 to 3
Normal ScalingReplicaSet 24s deployment-controller Scaled up replica set nginx-deployment-1564180365 to 1
Normal ScalingReplicaSet 22s deployment-controller Scaled down replica set nginx-deployment-2035384211 to 2
Normal ScalingReplicaSet 22s deployment-controller Scaled up replica set nginx-deployment-1564180365 to 2
Normal ScalingReplicaSet 19s deployment-controller Scaled down replica set nginx-deployment-2035384211 to 1
Normal ScalingReplicaSet 19s deployment-controller Scaled up replica set nginx-deployment-1564180365 to 3
Normal ScalingReplicaSet 14s deployment-controller Scaled down replica set nginx-deployment-2035384211 to 0
Name: nginx-deployment
Namespace: default
CreationTimestamp: Tue, 15 Mar 2016 14:48:04 -0700
Labels: app=nginx
Selector: app=nginx
Replicas: 3 desired | 1 updated | 4 total | 3 available | 1 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx:1.91
Port: 80/TCP
Host Port: 0/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True ReplicaSetUpdated
OldReplicaSets: nginx-deployment-1564180365 (3/3 replicas created)
NewReplicaSet: nginx-deployment-3066724191 (1/1 replicas created)
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1m 1m 1{deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-2035384211 to 3
22s 22s 1{deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-1564180365 to 1
22s 22s 1{deployment-controller } Normal ScalingReplicaSet Scaled down replica set nginx-deployment-2035384211 to 2
22s 22s 1{deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-1564180365 to 2
21s 21s 1{deployment-controller } Normal ScalingReplicaSet Scaled down replica set nginx-deployment-2035384211 to 1
21s 21s 1{deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-1564180365 to 3
13s 13s 1{deployment-controller } Normal ScalingReplicaSet Scaled down replica set nginx-deployment-2035384211 to 0
13s 13s 1{deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-3066724191 to 1
要解决此问题,需要回滚到以前稳定的 Deployment 版本。
检查 Deployment 上线历史
按照如下步骤检查回滚历史:
首先,检查 Deployment 修订历史:
kubectl rollout history deployment.v1.apps/nginx-deployment
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
nginx-deployment 3333 30m
获取 Deployment 描述信息:
kubectl describe deployment nginx-deployment
输出类似于:
Name: nginx-deployment
Namespace: default
CreationTimestamp: Sun, 02 Sep 2018 18:17:55 -0500
Labels: app=nginx
Annotations: deployment.kubernetes.io/revision=4
kubernetes.io/change-cause=kubectl set image deployment.v1.apps/nginx-deployment nginx=nginx:1.16.1 --record=true
Selector: app=nginx
Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx:1.16.1
Port: 80/TCP
Host Port: 0/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: nginx-deployment-c4747d96c (3/3 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 12m deployment-controller Scaled up replica set nginx-deployment-75675f5897 to 3
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 1
Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 2
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 2
Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 1
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 3
Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 0
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-595696685f to 1
Normal DeploymentRollback 15s deployment-controller Rolled back deployment "nginx-deployment" to revision 2
Normal ScalingReplicaSet 15s deployment-controller Scaled down replica set nginx-deployment-595696685f to 0
ReplicaSet 的目的是维护一组在任何时候都处于运行状态的 Pod 副本的稳定集合。
因此,它通常用来保证给定数量的、完全相同的 Pod 的可用性。
ReplicaSet 的工作原理
RepicaSet 是通过一组字段来定义的,包括一个用来识别可获得的 Pod
的集合的选择算符、一个用来标明应该维护的副本个数的数值、一个用来指定应该创建新 Pod
以满足副本个数条件时要使用的 Pod 模板等等。
每个 ReplicaSet 都通过根据需要创建和 删除 Pod 以使得副本个数达到期望值,
进而实现其存在价值。当 ReplicaSet 需要创建新的 Pod 时,会使用所提供的 Pod 模板。
ReplicaSet 通过 Pod 上的
metadata.ownerReferences
字段连接到附属 Pod,该字段给出当前对象的属主资源。
ReplicaSet 所获得的 Pod 都在其 ownerReferences 字段中包含了属主 ReplicaSet
的标识信息。正是通过这一连接,ReplicaSet 知道它所维护的 Pod 集合的状态,
并据此计划其操作行为。
ReplicaSet 使用其选择算符来辨识要获得的 Pod 集合。如果某个 Pod 没有
OwnerReference 或者其 OwnerReference 不是一个
控制器,且其匹配到
某 ReplicaSet 的选择算符,则该 Pod 立即被此 ReplicaSet 获得。
何时使用 ReplicaSet
ReplicaSet 确保任何时间都有指定数量的 Pod 副本在运行。
然而,Deployment 是一个更高级的概念,它管理 ReplicaSet,并向 Pod
提供声明式的更新以及许多其他有用的功能。
因此,我们建议使用 Deployment 而不是直接使用 ReplicaSet,除非
你需要自定义更新业务流程或根本不需要更新。
apiVersion:apps/v1kind:ReplicaSetmetadata:name:frontendlabels:app:guestbooktier:frontendspec:# modify replicas according to your casereplicas:3selector:matchLabels:tier:frontendtemplate:metadata:labels:tier:frontendspec:containers:- name:php-redisimage:gcr.io/google_samples/gb-frontend:v3
apiVersion:v1kind:Servicemetadata:name:nginxlabels:app:nginxspec:ports:- port:80name:webclusterIP:Noneselector:app:nginx---apiVersion:apps/v1kind:StatefulSetmetadata:name:webspec:selector:matchLabels:app:nginx# has to match .spec.template.metadata.labelsserviceName:"nginx"replicas:3# by default is 1template:metadata:labels:app:nginx# has to match .spec.selector.matchLabelsspec:terminationGracePeriodSeconds:10containers:- name:nginximage:k8s.gcr.io/nginx-slim:0.8ports:- containerPort:80name:webvolumeMounts:- name:wwwmountPath:/usr/share/nginx/htmlvolumeClaimTemplates:- metadata:name:wwwspec:accessModes:["ReadWriteOnce"]storageClassName:"my-storage-class"resources:requests:storage:1Gi
上述例子中:
名为 nginx 的 Headless Service 用来控制网络域名。
名为 web 的 StatefulSet 有一个 Spec,它表明将在独立的 3 个 Pod 副本中启动 nginx 容器。
StatefulSet Pod 具有唯一的标识,该标识包括顺序标识、稳定的网络标识和稳定的存储。
该标识和 Pod 是绑定的,不管它被调度在哪个节点上。
有序索引
对于具有 N 个副本的 StatefulSet,StatefulSet 中的每个 Pod 将被分配一个整数序号,
从 0 到 N-1,该序号在 StatefulSet 上是唯一的。
稳定的网络 ID
StatefulSet 中的每个 Pod 根据 StatefulSet 的名称和 Pod 的序号派生出它的主机名。
组合主机名的格式为$(StatefulSet 名称)-$(序号)。
上例将会创建三个名称分别为 web-0、web-1、web-2 的 Pod。
StatefulSet 可以使用 无头服务
控制它的 Pod 的网络域。管理域的这个服务的格式为:
$(服务名称).$(命名空间).svc.cluster.local,其中 cluster.local 是集群域。
一旦每个 Pod 创建成功,就会得到一个匹配的 DNS 子域,格式为:
$(pod 名称).$(所属服务的 DNS 域名),其中所属服务由 StatefulSet 的 serviceName 域来设定。
取决于集群域内部 DNS 的配置,有可能无法查询一个刚刚启动的 Pod 的 DNS 命名。
当集群内其他客户端在 Pod 创建完成前发出 Pod 主机名查询时,就会发生这种情况。
负缓存 (在 DNS 中较为常见) 意味着之前失败的查询结果会被记录和重用至少若干秒钟,
即使 Pod 已经正常运行了也是如此。
如果需要在 Pod 被创建之后及时发现它们,有以下选项:
直接查询 Kubernetes API(比如,利用 watch 机制)而不是依赖于 DNS 查询
缩短 Kubernetes DNS 驱动的缓存时长(通常这意味着修改 CoreDNS 的 ConfigMap,目前缓存时长为 30 秒)
apiVersion:apps/v1kind:DaemonSetmetadata:name:fluentd-elasticsearchnamespace:kube-systemlabels:k8s-app:fluentd-loggingspec:selector:matchLabels:name:fluentd-elasticsearchtemplate:metadata:labels:name:fluentd-elasticsearchspec:tolerations:# this toleration is to have the daemonset runnable on master nodes# remove it if your masters can't run pods- key:node-role.kubernetes.io/masteroperator:Existseffect:NoSchedulecontainers:- name:fluentd-elasticsearchimage:quay.io/fluentd_elasticsearch/fluentd:v2.5.2resources:limits:memory:200Mirequests:cpu:100mmemory:200MivolumeMounts:- name:varlogmountPath:/var/log- name:varlibdockercontainersmountPath:/var/lib/docker/containersreadOnly:trueterminationGracePeriodSeconds:30volumes:- name:varloghostPath:path:/var/log- name:varlibdockercontainershostPath:path:/var/lib/docker/containers
通过在一个指定的、受 kubelet 监视的目录下编写文件来创建 Pod 也是可行的。
这类 Pod 被称为静态 Pod。
不像 DaemonSet,静态 Pod 不受 kubectl 和其它 Kubernetes API 客户端管理。
静态 Pod 不依赖于 API 服务器,这使得它们在启动引导新集群的情况下非常有用。
此外,静态 Pod 在将来可能会被废弃。
Deployments
DaemonSet 与 Deployments 非常类似,
它们都能创建 Pod,并且 Pod 中的进程都不希望被终止(例如,Web 服务器、存储服务器)。
建议为无状态的服务使用 Deployments,比如前端服务。
对这些服务而言,对副本的数量进行扩缩容、平滑升级,比精确控制 Pod 运行在某个主机上要重要得多。
当需要 Pod 副本总是运行在全部或特定主机上,并需要它们先于其他 Pod 启动时,
应该使用 DaemonSet。
Indexed:Job 的 Pod 会获得对应的完成索引,取值为 0 到 .spec.completions-1,
存放在注解 batch.kubernetes.io/job-completion-index 中。
当每个索引都对应一个完成完成的 Pod 时,Job 被认为是已完成的。
关于如何使用这种模式的更多信息,可参阅
用带索引的 Job 执行基于静态任务分配的并行处理。
需要注意的是,对同一索引值可能被启动的 Pod 不止一个,尽管这种情况很少发生。
这时,只有一个会被记入完成计数中。
处理 Pod 和容器失效
Pod 中的容器可能因为多种不同原因失效,例如因为其中的进程退出时返回值非零,
或者容器因为超出内存约束而被杀死等等。
如果发生这类事件,并且 .spec.template.spec.restartPolicy = "OnFailure",
Pod 则继续留在当前节点,但容器会被重新运行。
因此,你的程序需要能够处理在本地被重启的情况,或者要设置
.spec.template.spec.restartPolicy = "Never"。
关于 restartPolicy 的更多信息,可参阅
Pod 生命周期。
整个 Pod 也可能会失败,且原因各不相同。
例如,当 Pod 启动时,节点失效(被升级、被重启、被删除等)或者其中的容器失败而
.spec.template.spec.restartPolicy = "Never"。
当 Pod 失败时,Job 控制器会启动一个新的 Pod。
这意味着,你的应用需要处理在一个新 Pod 中被重启的情况。
尤其是应用需要处理之前运行所产生的临时文件、锁、不完整的输出等问题。
Name: myjob
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 12m job-controller Created pod: myjob-hlrpl
Normal SuccessfulDelete 11m job-controller Deleted pod: myjob-hlrpl
Normal Suspended 11m job-controller Job suspended
Normal SuccessfulCreate 3s job-controller Created pod: myjob-jvb44
Normal Resumed 3s job-controller Job resumed
如果外部的 IP 路由到集群中一个或多个 Node 上,Kubernetes Service 会被暴露给这些 externalIPs。
通过外部 IP(作为目的 IP 地址)进入到集群,打到 Service 的端口上的流量,
将会被路由到 Service 的 Endpoint 上。
externalIPs 不会被 Kubernetes 管理,它属于集群管理员的职责范畴。
根据 Service 的规定,externalIPs 可以同任意的 ServiceType 来一起指定。
在上面的例子中,my-service 可以在 "80.11.12.10:80"(externalIP:port) 上被客户端访问。
为 VIP 使用用户空间代理,将只适合小型到中型规模的集群,不能够扩展到上千 Service 的大型集群。
查看最初设计方案 获取更多细节。
使用用户空间代理,隐藏了访问 Service 的数据包的源 IP 地址。
这使得一些类型的防火墙无法起作用。
iptables 代理不会隐藏 Kubernetes 集群内部的 IP 地址,但却要求客户端请求
必须通过一个负载均衡器或 Node 端口。
Type 字段支持嵌套功能 —— 每一层需要添加到上一层里面。
不会严格要求所有云提供商(例如,GCE 就没必要为了使一个 LoadBalancer
能工作而分配一个 NodePort,但是 AWS 需要 ),但当前 API 是强制要求的。
虚拟IP实施
对很多想使用 Service 的人来说,前面的信息应该足够了。
然而,有很多内部原理性的内容,还是值去理解的。
避免冲突
Kubernetes 最主要的哲学之一,是用户不应该暴露那些能够导致他们操作失败、但又不是他们的过错的场景。
对于 Service 资源的设计,这意味着如果用户的选择有可能与他人冲突,那就不要让用户自行选择端口号。
这是一个隔离性的失败。
为了使用户能够为他们的 Service 选择一个端口号,我们必须确保不能有2个 Service 发生冲突。
Kubernetes 通过为每个 Service 分配它们自己的 IP 地址来实现。
为了保证每个 Service 被分配到一个唯一的 IP,需要一个内部的分配器能够原子地更新
etcd 中的一个全局分配映射表,
这个更新操作要先于创建每一个 Service。
为了使 Service 能够获取到 IP,这个映射表对象必须在注册中心存在,
否则创建 Service 将会失败,指示一个 IP 不能被分配。
在控制平面中,一个后台 Controller 的职责是创建映射表
(需要支持从使用了内存锁的 Kubernetes 的旧版本迁移过来)。
同时 Kubernetes 会通过控制器检查不合理的分配(如管理员干预导致的)
以及清理已被分配但不再被任何 Service 使用的 IP 地址。
Service IP 地址
不像 Pod 的 IP 地址,它实际路由到一个固定的目的地,Service 的 IP 实际上
不能通过单个主机来进行应答。
相反,我们使用 iptables(Linux 中的数据包处理逻辑)来定义一个
虚拟 IP 地址(VIP),它可以根据需要透明地进行重定向。
当客户端连接到 VIP 时,它们的流量会自动地传输到一个合适的 Endpoint。
环境变量和 DNS,实际上会根据 Service 的 VIP 和端口来进行填充。
Kubernetes 为服务和 Pods 创建 DNS 记录。
你可以使用一致的 DNS 名称而非 IP 地址来访问服务。
介绍
Kubernetes DNS 在集群上调度 DNS Pod 和服务,并配置 kubelet 以告知各个容器
使用 DNS 服务的 IP 来解析 DNS 名称。
集群中定义的每个 Service (包括 DNS 服务器自身)都被赋予一个 DNS 名称。
默认情况下,客户端 Pod 的 DNS 搜索列表会包含 Pod 自身的名字空间和集群
的默认域。
Service 的名字空间
DNS 查询可能因为执行查询的 Pod 所在的名字空间而返回不同的结果。
不指定名字空间的 DNS 查询会被限制在 Pod 所在的名字空间内。
要访问其他名字空间中的服务,需要在 DNS 查询中给出名字空间。
例如,假定名字空间 test 中存在一个 Pod,prod 名字空间中存在一个服务
data。
Pod 查询 data 时没有返回结果,因为使用的是 Pod 的名字空间 test。
Pod 查询 data.prod 时则会返回预期的结果,因为查询中指定了名字空间。
DNS 查询可以使用 Pod 中的 /etc/resolv.conf 展开。kubelet 会为每个 Pod
生成此文件。例如,对 data 的查询可能被展开为 data.test.cluster.local。
search 选项的取值会被用来展开查询。要进一步了解 DNS 查询,可参阅
resolv.conf 手册页面。
概括起来,名字空间 test 中的 Pod 可以成功地解析 data.prod 或者
data.prod.cluster.local。
DNS 记录
哪些对象会获得 DNS 记录呢?
Services
Pods
以下各节详细介绍了被支持的 DNS 记录类型和被支持的布局。
其它布局、名称或者查询即使碰巧可以工作,也应视为实现细节,
将来很可能被更改而且不会因此发出警告。
有关最新规范请查看
Kubernetes 基于 DNS 的服务发现。
服务
A/AAAA 记录
“普通” 服务(除了无头服务)会以 my-svc.my-namespace.svc.cluster-domain.example
这种名字的形式被分配一个 DNS A 或 AAAA 记录,取决于服务的 IP 协议族。
该名称会解析成对应服务的集群 IP。
“无头(Headless)” 服务(没有集群 IP)也会以
my-svc.my-namespace.svc.cluster-domain.example 这种名字的形式被指派一个 DNS A 或 AAAA 记录,
具体取决于服务的 IP 协议族。
与普通服务不同,这一记录会被解析成对应服务所选择的 Pod 集合的 IP。
客户端要能够使用这组 IP,或者使用标准的轮转策略从这组 IP 中进行选择。
SRV 记录
Kubernetes 会为命名端口创建 SRV 记录,这些端口是普通服务或
无头服务的一部分。
对每个命名端口,SRV 记录具有 _my-port-name._my-port-protocol.my-svc.my-namespace.svc.cluster-domain.example 这种形式。
对普通服务,该记录会被解析成端口号和域名:my-svc.my-namespace.svc.cluster-domain.example。
对无头服务,该记录会被解析成多个结果,服务对应的每个后端 Pod 各一个;
其中包含 Pod 端口号和形为 auto-generated-name.my-svc.my-namespace.svc.cluster-domain.example
的域名。
如果某无头服务与某 Pod 在同一个名字空间中,且它们具有相同的子域名,
集群的 DNS 服务器也会为该 Pod 的全限定主机名返回 A 记录或 AAAA 记录。
例如,在同一个名字空间中,给定一个主机名为 “busybox-1”、
子域名设置为 “default-subdomain” 的 Pod,和一个名称为 “default-subdomain”
的无头服务,Pod 将看到自己的 FQDN 为
"busybox-1.default-subdomain.my-namespace.svc.cluster-domain.example"。
DNS 会为此名字提供一个 A 记录或 AAAA 记录,指向该 Pod 的 IP。
“busybox1” 和 “busybox2” 这两个 Pod 分别具有它们自己的 A 或 AAAA 记录。
Endpoints 对象可以为任何端点地址及其 IP 指定 hostname。
说明:
因为没有为 Pod 名称创建 A 记录或 AAAA 记录,所以要创建 Pod 的 A 记录
或 AAAA 记录需要 hostname。
没有设置 hostname 但设置了 subdomain 的 Pod 只会为
无头服务创建 A 或 AAAA 记录(default-subdomain.my-namespace.svc.cluster-domain.example)
指向 Pod 的 IP 地址。
另外,除非在服务上设置了 publishNotReadyAddresses=True,否则只有 Pod 进入就绪状态
才会有与之对应的记录。
当你在 Pod 规约中设置了 setHostnameAsFQDN: true 时,kubelet 会将 Pod
的全限定域名(FQDN)作为该 Pod 的主机名记录到 Pod 所在名字空间。
在这种情况下,hostname 和 hostname --fqdn 都会返回 Pod 的全限定域名。
说明:
在 Linux 中,内核的主机名字段(struct utsname 的 nodename 字段)限定
最多 64 个字符。
如果 Pod 启用这一特性,而其 FQDN 超出 64 字符,Pod 的启动会失败。
Pod 会一直出于 Pending 状态(通过 kubectl 所看到的 ContainerCreating),
并产生错误事件,例如
"Failed to construct FQDN from pod hostname and cluster domain, FQDN
long-FQDN is too long (64 characters is the max, 70 characters requested)."
(无法基于 Pod 主机名和集群域名构造 FQDN,FQDN long-FQDN 过长,至多 64
字符,请求字符数为 70)。
对于这种场景而言,改善用户体验的一种方式是创建一个
准入 Webhook 控制器,
在用户创建顶层对象(如 Deployment)的时候控制 FQDN 的长度。
Pod 的 DNS 策略
DNS 策略可以逐个 Pod 来设定。目前 Kubernetes 支持以下特定 Pod 的 DNS 策略。
这些策略可以在 Pod 规约中的 dnsPolicy 字段设置:
nameservers:将用作于 Pod 的 DNS 服务器的 IP 地址列表。
最多可以指定 3 个 IP 地址。当 Pod 的 dnsPolicy 设置为 "None" 时,
列表必须至少包含一个 IP 地址,否则此属性是可选的。
所列出的服务器将合并到从指定的 DNS 策略生成的基本名称服务器,并删除重复的地址。
searches:用于在 Pod 中查找主机名的 DNS 搜索域的列表。此属性是可选的。
指定此属性时,所提供的列表将合并到根据所选 DNS 策略生成的基本搜索域名中。
重复的域名将被删除。Kubernetes 最多允许 6 个搜索域。
options:可选的对象列表,其中每个对象可能具有 name 属性(必需)和 value 属性(可选)。
此属性中的内容将合并到从指定的 DNS 策略生成的选项。
重复的条目将被删除。
默认情况下,Docker 使用私有主机网络连接,只能与同在一台机器上的容器进行通信。
为了实现容器的跨节点通信,必须在机器自己的 IP 上为这些容器分配端口,为容器进行端口转发或者代理。
多个开发人员或是提供容器的团队之间协调端口的分配很难做到规模化,那些难以控制的集群级别的问题,都会交由用户自己去处理。
Kubernetes 假设 Pod 可与其它 Pod 通信,不管它们在哪个主机上。
Kubernetes 给 Pod 分配属于自己的集群私有 IP 地址,所以没必要在 Pod 或映射到的容器的端口和主机端口之间显式地创建连接。
这表明了在 Pod 内的容器都能够连接到本地的每个端口,集群中的所有 Pod 不需要通过 NAT 转换就能够互相看到。
文档的剩余部分详述如何在一个网络模型之上运行可靠的服务。
我们有 Pod 在一个扁平的、集群范围的地址空间中运行 Nginx 服务,可以直接连接到这些 Pod,但如果某个节点死掉了会发生什么呢?
Pod 会终止,Deployment 将创建新的 Pod,且使用不同的 IP。这正是 Service 要解决的问题。
Kubernetes Service 从逻辑上定义了运行在集群中的一组 Pod,这些 Pod 提供了相同的功能。
当每个 Service 创建时,会被分配一个唯一的 IP 地址(也称为 clusterIP)。
这个 IP 地址与一个 Service 的生命周期绑定在一起,当 Service 存在的时候它也不会改变。
可以配置 Pod 使它与 Service 进行通信,Pod 知道与 Service 通信将被自动地负载均衡到该 Service 中的某些 Pod 上。
上述规约将创建一个 Service,对应具有标签 run: my-nginx 的 Pod,目标 TCP 端口 80,
并且在一个抽象的 Service 端口(targetPort:容器接收流量的端口;port:抽象的 Service
端口,可以使任何其它 Pod 访问该 Service 的端口)上暴露。
查看 Service API 对象
了解 Service 定义支持的字段列表。
查看你的 Service 资源:
kubectl get svc my-nginx
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
my-nginx ClusterIP 10.0.162.149 <none> 80/TCP 21s
正如前面所提到的,一个 Service 由一组 backend Pod 组成。这些 Pod 通过 endpoints 暴露出来。
Service Selector 将持续评估,结果被 POST 到一个名称为 my-nginx 的 Endpoint 对象上。
当 Pod 终止后,它会自动从 Endpoint 中移除,新的能够匹配上 Service Selector 的 Pod 将自动地被添加到 Endpoint 中。
检查该 Endpoint,注意到 IP 地址与在第一步创建的 Pod 是相同的。
对我们应用的某些部分,可能希望将 Service 暴露在一个外部 IP 地址上。
Kubernetes 支持两种实现方式:NodePort 和 LoadBalancer。
在上一段创建的 Service 使用了 NodePort,因此 Nginx https 副本已经就绪,
如果使用一个公网 IP,能够处理 Internet 上的流量。
此特性开启后,EndpointSlice 控制器负责在 EndpointSlice 上设置提示信息。
控制器按比例给每个区域分配一定比例数量的端点。
这个比例来源于此区域中运行节点的
可分配
CPU 核心数。
例如,如果一个区域拥有 2 CPU 核心,而另一个区域只有 1 CPU 核心,
那控制器将给那个有 2 CPU 的区域分配两倍数量的端点。
在 v1 API 中,逐个端点设置的 topology 实际上被去除,以鼓励使用专用
的字段 nodeName 和 zone。
对 EndpointSlice 对象的 endpoint 字段设置任意的拓扑结构信息这一操作已被
废弃,不再被 v1 API 所支持。取而代之的是 v1 API 所支持的 nodeName 和 zone
这些独立的字段。这些字段可以在不同的 API 版本之间自动完成转译。
例如,v1beta1 API 中 topology 字段的 topology.kubernetes.io/zone 取值可以
在 v1 API 中通过 zone 字段访问。
Kubernetes 支持很多类型的卷。
Pod 可以同时使用任意数目的卷类型。
临时卷类型的生命周期与 Pod 相同,但持久卷可以比 Pod 的存活期长。
当 Pod 不再存在时,Kubernetes 也会销毁临时卷;不过 Kubernetes 不会销毁
持久卷。对于给定 Pod 中任何类型的卷,在容器重启期间数据都不会丢失。
当 Pod 分派到某个 Node 上时,emptyDir 卷会被创建,并且在 Pod 在该节点上运行期间,卷一直存在。
就像其名称表示的那样,卷最初是空的。
尽管 Pod 中的容器挂载 emptyDir 卷的路径可能相同也可能不同,这些容器都可以读写
emptyDir 卷中相同的文件。
当 Pod 因为某些原因被从节点上删除时,emptyDir 卷中的数据也会被永久删除。
说明: 容器崩溃并不会导致 Pod 被从节点上移除,因此容器崩溃期间 emptyDir 卷中的数据是安全的。
gcePersistentDisk 卷能将谷歌计算引擎 (GCE) 持久盘(PD)
挂载到你的 Pod 中。
不像 emptyDir 那样会在 Pod 被删除的同时也会被删除,持久盘卷的内容在删除 Pod
时会被保留,卷只是被卸载了。
这意味着持久盘卷可以被预先填充数据,并且这些数据可以在 Pod 之间共享。
iscsi 卷能将 iSCSI (基于 IP 的 SCSI) 卷挂载到你的 Pod 中。
不像 emptyDir 那样会在删除 Pod 的同时也会被删除,iscsi 卷的内容在删除 Pod 时
会被保留,卷只是被卸载。
这意味着 iscsi 卷可以被预先填充数据,并且这些数据可以在 Pod 之间共享。
注意: 在使用 iSCSI 卷之前,你必须拥有自己的 iSCSI 服务器,并在上面创建卷。
iSCSI 的一个特点是它可以同时被多个用户以只读方式挂载。
这意味着你可以用数据集预先填充卷,然后根据需要在尽可能多的 Pod 上使用它。
不幸的是,iSCSI 卷只能由单个使用者以读写模式挂载。不允许同时写入。
Vsphere Infrastructure(VI)管理员将能够在动态卷配置期间指定自定义 Virtual SAN
存储功能。你现在可以在动态制备卷期间以存储能力的形式定义存储需求,例如性能和可用性。
存储能力需求会转换为 Virtual SAN 策略,之后当持久卷(虚拟磁盘)被创建时,
会将其推送到 Virtual SAN 层。虚拟磁盘分布在 Virtual SAN 数据存储中以满足要求。
关键的设计思想是在 Pod 的卷来源中允许使用
卷申领的参数。
PersistentVolumeClaim 的标签、注解和整套字段集均被支持。
创建这样一个 Pod 后,
临时卷控制器在 Pod 所属的命名空间中创建一个实际的 PersistentVolumeClaim 对象,
并确保删除 Pod 时,同步删除 PersistentVolumeClaim。
如上设置将触发卷的绑定与/或准备操作,相应动作或者在
StorageClass
使用即时卷绑定时立即执行,
或者当 Pod 被暂时性调度到某节点时执行 (WaitForFirstConsumer 卷绑定模式)。
对于常见的临时卷,建议采用后者,这样调度器就可以自由地为 Pod 选择合适的节点。
对于即时绑定,调度器则必须选出一个节点,使得在卷可用时,能立即访问该卷。
就资源所有权而言,
拥有通用临时存储的 Pod 是提供临时存储 (ephemeral storage) 的 PersistentVolumeClaim 的所有者。
当 Pod 被删除时,Kubernetes 垃圾收集器会删除 PVC,
然后 PVC 通常会触发卷的删除,因为存储类的默认回收策略是删除卷。
你可以使用带有 retain 回收策略的 StorageClass 创建准临时 (quasi-ephemeral) 本地存储:
该存储比 Pod 寿命长,在这种情况下,你需要确保单独进行卷清理。
Deployment 既可以创建一个 ReplicaSet 来确保预期个数的 Pod 始终可用,也可以指定替换 Pod 的策略(例如
RollingUpdate)。
除了一些显式的 restartPolicy: Never
场景外,Deployment 通常比直接创建 Pod 要好得多。Job 也可能是合适的选择。
您可以操纵标签进行调试。
由于 Kubernetes 控制器(例如 ReplicaSet)和服务使用选择器标签来匹配 Pod,
从 Pod 中删除相关标签将阻止其被控制器考虑或由服务提供服务流量。
如果删除现有 Pod 的标签,其控制器将创建一个新的 Pod 来取代它。
这是在"隔离"环境中调试先前"活跃"的 Pod 的有用方法。
要以交互方式删除或添加标签,请使用 kubectl label。
这一命令会打开默认的编辑器,允许你更新 data 字段中包含的
base64 编码的 Secret 值:
# Please edit the object below. Lines beginning with a '#' will be ignored,# and an empty file will abort the edit. If an error occurs while saving this file will be# reopened with the relevant failures.#apiVersion:v1data:username:YWRtaW4=password:MWYyZDFlMmU2N2Rmkind:Secretmetadata:annotations:kubectl.kubernetes.io/last-applied-configuration:{... }creationTimestamp:2016-01-22T18:41:56Zname:mysecretnamespace:defaultresourceVersion:"164619"uid:cfee02d6-c137-11e5-8d73-42010af00002type:Opaque
使用 Secret
Secret 可以作为数据卷被挂载,或作为环境变量
暴露出来以供 Pod 中的容器使用。它们也可以被系统的其他部分使用,而不直接暴露在 Pod 内。
例如,它们可以保存凭据,系统的其他部分将用它来代表你与外部系统进行交互。
在 Pod 中使用 Secret 文件
在 Pod 中使用存放在卷中的 Secret:
创建一个 Secret 或者使用已有的 Secret。多个 Pod 可以引用同一个 Secret。
修改你的 Pod 定义,在 spec.volumes[] 下增加一个卷。可以给这个卷随意命名,
它的 spec.volumes[].secret.secretName 必须是 Secret 对象的名字。
LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON
0s 0s 1 dapi-test-pod Pod Warning InvalidEnvironmentVariableNames kubelet, 127.0.0.1 Keys [1badkey, 2alsobad] from the EnvFrom secret default/mysecret were skipped since they are considered invalid environment variable names.
Secret 与 Pod 生命周期的关系
通过 API 创建 Pod 时,不会检查引用的 Secret 是否存在。一旦 Pod 被调度,kubelet
就会尝试获取该 Secret 的值。如果获取不到该 Secret,或者暂时无法与 API 服务器建立连接,
kubelet 将会定期重试。kubelet 将会报告关于 Pod 的事件,并解释它无法启动的原因。
一旦获取到 Secret,kubelet 将创建并挂载一个包含它的卷。在 Pod 的所有卷被挂载之前,
Pod 中的容器不会启动。
如果某 Container 设置了自己的内存限制但未设置内存请求,Kubernetes
自动为其设置与内存限制相匹配的请求值。类似的,如果某 Container 设置了
CPU 限制值但未设置 CPU 请求值,则 Kubernetes 自动为其设置 CPU 请求
并使之与 CPU 限制值匹配。
资源类型
CPU 和内存都是资源类型。每种资源类型具有其基本单位。
CPU 表达的是计算处理能力,其单位是 Kubernetes CPUs。
内存的单位是字节。
如果你使用的是 Kubernetes v1.14 或更高版本,则可以指定巨页(Huge Page)资源。
巨页是 Linux 特有的功能,节点内核在其中分配的内存块比默认页大小大得多。
当你创建一个 Pod 时,Kubernetes 调度程序将为 Pod 选择一个节点。
每个节点对每种资源类型都有一个容量上限:可为 Pod 提供的 CPU 和内存量。
调度程序确保对于每种资源类型,所调度的容器的资源请求的总和小于节点的容量。
请注意,尽管节点上的实际内存或 CPU 资源使用量非常低,如果容量检查失败,
调度程序仍会拒绝在该节点上放置 Pod。
当稍后节点上资源用量增加,例如到达请求率的每日峰值区间时,节点上也不会出现资源不足的问题。
带资源约束的 Pod 如何运行
当 kubelet 启动 Pod 中的 Container 时,它会将 CPU 和内存约束信息传递给容器运行时。
你的容器可能因为资源紧张而被终止。要查看容器是否因为遇到资源限制而被杀死,
请针对相关的 Pod 执行 kubectl describe pod:
kubectl describe pod simmemleak-hra99
Name: simmemleak-hra99
Namespace: default
Image(s): saadali/simmemleak
Node: kubernetes-node-tf0f/10.240.216.66
Labels: name=simmemleak
Status: Running
Reason:
Message:
IP: 10.244.2.75
Replication Controllers: simmemleak (1/1 replicas created)
Containers:
simmemleak:
Image: saadali/simmemleak
Limits:
cpu: 100m
memory: 50Mi
State: Running
Started: Tue, 07 Jul 2015 12:54:41 -0700
Last Termination State: Terminated
Exit Code: 1
Started: Fri, 07 Jul 2015 12:54:30 -0700
Finished: Fri, 07 Jul 2015 12:54:33 -0700
Ready: False
Restart Count: 5
Conditions:
Type Status
Ready False
Events:
FirstSeen LastSeen Count From SubobjectPath Reason Message
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51 -0700 1 {scheduler } scheduled Successfully assigned simmemleak-hra99 to kubernetes-node-tf0f
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51 -0700 1 {kubelet kubernetes-node-tf0f} implicitly required container POD pulled Pod container image "k8s.gcr.io/pause:0.8.0" already present on machine
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51 -0700 1 {kubelet kubernetes-node-tf0f} implicitly required container POD created Created with docker id 6a41280f516d
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51 -0700 1 {kubelet kubernetes-node-tf0f} implicitly required container POD started Started with docker id 6a41280f516d
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51 -0700 1 {kubelet kubernetes-node-tf0f} spec.containers{simmemleak} created Created with docker id 87348f12526a
在上面的例子中,Restart Count: 5 意味着 Pod 中的 simmemleak 容器被终止并重启了五次。
你可以使用 kubectl get pod 命令加上 -o go-template=... 选项来获取之前终止容器的状态。
kubectl get pod -o go-template='{{range.status.containerStatuses}}{{"Container Name: "}}{{.name}}{{"\r\nLastState: "}}{{.lastState}}{{end}}' simmemleak-hra99
如果你要删除 PriorityClass,则使用所删除的 PriorityClass 名称的现有 Pod 都
不会受影响,但是你不可以再创建使用该 PriorityClass 名称的新 Pod。
PriorityClass 示例
apiVersion:scheduling.k8s.io/v1kind:PriorityClassmetadata:name:high-priorityvalue:1000000globalDefault:falsedescription:"This priority class should be used for XYZ service pods only."
非抢占式的 PriorityClass
FEATURE STATE:Kubernetes v1.15 [alpha]
配置 preemptionPolicy: Never 的 Pod 在调度队列中会被放在低优先级的 Pod
的前面,但是它们不可以抢占其他 Pod。
非抢占 Pod 会在调度队列中等待调度,直到有足够空闲资源时才被调度。
非抢占 Pod 与其他 Pod 一样,也受调度器回退(Back-off)机制影响。
换言之,如果调度器尝试调度这些 Pod 时发现它们无法调度,它们会被再次尝试,并且
重试的频率会被降低,这样可以使得其他优先级较低的 Pod 有机会在它们之前被调度。
非抢占 Pod 仍有可能被其他高优先级的 Pod 抢占。
preemptionPolicy 默认取值为 PreemptLowerPriority,这会使得该 PriorityClass
的 Pod 能够抢占低优先级的 Pod(这也是当前的默认行为)。
如果 preemptionPolicy 被设置为 Never,则该 PriorityClass 下的 Pod 都是非抢占的。
一种示例应用场景是数据科学负载。
用户可能希望所提交的 Job 比其他负载的优先级都高,但又不希望因为抢占运行中的
Pod 而丢弃现有工作。
只要集群中"自然地"释放出足够的资源,配置了 preemptionPolicy: Never
的高优先级 Job 可以在队列中其他 Pod 之前获得调度机会。
非抢占 PriorityClass 示例
apiVersion:scheduling.k8s.io/v1kind:PriorityClassmetadata:name:high-priority-nonpreemptingvalue:1000000preemptionPolicy:NeverglobalDefault:falsedescription:"This priority class will not cause other pods to be preempted."
Pod 优先级
在已经创建了一个或多个 PriorityClass 对象之后,你就可以创建 Pod 并在其规约中
指定这些 PriorityClass 的名字之一。优先级准入控制器使用 priorityClassName
字段来填充优先级整数值。如果所指定优先级类不存在,则 Pod 被拒绝。
下面的 YAML 是一个 Pod 配置,使用了前面例子中创建的 PriorityClass。
优先级准入控制器检查 Pod 的规约并将 Pod 优先级解析为 1000000。
当集群启用了 Pod 优先级时,调度器会基于 Pod 的优先级来排序悬决的 Pod。
新 Pod 会被放在调度队列中较低优先级的其他悬决 Pod 前面。
因此,优先级较高的 Pod 在其调度需求被满足的前提下会比优先级低的 Pod 先被调度。
如果优先级较高的 Pod 无法被调度,调度器会继续尝试调度其他较低优先级的 Pod。
抢占
Pod 被创建时会被放入一个队列中等待调度。调度器从队列中选择 Pod,尝试将其调度到某 Node 上。
如果找不到能够满足 Pod 所设置需求的 Node,就会触发悬决 Pod 的抢占逻辑。
假定 P 是悬决的 Pod,抢占逻辑会尝试找到一个这样的节点,在该节点上移除一个或者多个
优先级比 P 低的 Pod 后,P 就可以被调度到该节点。如果调度器能够找到这样的节点,
该节点上的一个或者多个优先级较低的 Pod 就会被逐出。当被逐出的 Pod 从该节点上
消失时,P 就可以调度到此节点。
暴露给用户的信息
当 Pod P 在节点 N 上抢占了一个或多个 Pod 时,Pod P 的状态中的nominatedNodeName 字段
会被设置为节点 N 的名字。此字段有助于调度器跟踪为 P
所预留的资源,同时也给用户提供了其集群中发生的抢占的信息。
请注意,Pod P 不一定会被调度到其 "nominated node(提名节点)"。
当选定的 Pod 被抢占时,它们都会有其体面终止时限(Graceful Termination Period)。
如果在调度器等待选定的(被牺牲的)Pod 终止期间有新的节点可用,调度器会使用其他
节点来调度 Pod P。因此,Pod 中的 nominatedNodeName 和 nodeName 并不总是相同。
此外,如果调度器抢占了节点 N 上的 Pod,但接下来出现优先级比 P 还高的 Pod 要被
调度,则调度器会把节点 N 让给新的优先级更高的 Pod。如果发生了这种情况,调度器
会清除 Pod P 的 nominatedNodeName。通过清除操作,调度器使得 Pod P 可以尝试
抢占别的节点上的 Pod。
抢占的局限性
抢占牺牲者的体面终止期限
当 Pod 被抢占时,做出牺牲的 Pod 仍有各自的
体面终止期限。
这些 Pod 可以在给定的期限内结束其工作并退出。如果它们不能及时退出则会被杀死。
这一体面终止期限带来了一个时间空隙,跨度从调度器开始抢占 Pod 的那一刻到悬决 Pod
(P)可以被调度到节点(N)上的那一刻。
与此同时,调度器还要继续调度其他悬决的 Pod。
随着被抢占的 Pod 退出或终止,调度器尝试继续尝试调度悬决队列中的 Pod。
因此,从调度器抢占被牺牲的 Pod 到 Pod P 被调度,中间通常存在一个时间间隔。
为了缩短此时间间隔,用户可以将低优先级的 Pod 的体面终止期限设置为 0
或者较小的数字。
PodDisruptionBudget 是被支持的,但不提供保证
PodDisruptionBudget (PDB)
的存在使得应用的属主能够限制多副本应用因主动干扰而同时离线的 Pod 的个数。
Kubernetes 在抢占 Pod 时是可以支持 PDB 的,但对 PDB 的约束也仅限于尽力而为。
调度器会尝试寻找不会因为抢占而违反其 PDB 约束的 Pod 作为牺牲品,不过如果
找不到这样的待逐出 Pod,抢占行为仍会发生,低优先级的 Pod 仍会被逐出而不管
是否违反其 PDB 约束。
低优先级 Pod 间的亲和性
只有对下面的问题的回答是肯定的的时候,才会考虑在节点上执行抢占操作:
"如果所有优先级低于悬决 Pod 的 Pod 都从节点上逐出,悬决 Pod
可以调度到此节点么?"
说明: 抢占操作不一定要逐出所有优先级较低的 Pod。
如果少逐出几个 Pod 而不是逐出所有较低优先级的 Pod 即可令悬决 Pod
被调度,则优先级较低的 Pod 中只有一部分会被逐出。
即便如此,对上述问题的回答仍须是肯定的。如果回答是否定的,Kubernetes
不会考虑在该节点上执行抢占操作。
如果悬决 Pod 与节点上一个或多个较低优先级的 Pod 之间存在 Pod 间亲和性关系,
那些对应的低优先级 Pod 若被逐出则无法满足此亲和性规则。
在这种场合下,调度器不会抢占节点上的任何 Pod。相反,它会尝试寻找其他节点。
调度器可能能找到也可能找不到合适的节点。
Kubernetes 并不保证悬决的 Pod 最终会被调度。
对此问题的一种解决方案是仅针对优先级相同或更高的 Pod 设置 Pod 间亲和性。
跨节点的抢占
假定当前正在考虑在节点 N 上执行抢占操作以便 Pod P 能够被调度到 N 上执行。
可是只有当另一个节点上的某个 Pod 被抢占,P 才有可能在 N 上调度执行。例如:
Pod P 正在考虑被调度到节点 N。
Pod Q 正运行在节点 N 所处区域(Zone)的另一个节点上。
Pod P 设置了区域范畴的与 Pod Q 的反亲和性
(topologyKey: topology.kubernetes.io/zone)。
Pod P 与区域中的其他 Pod 之间都不存在反亲和性关系。
为了将 P 调度到节点 N 上,Pod Q 可以被抢占,但是调度器不会执行跨节点的
抢占操作。因此,Pod P 会被视为无法调度到节点 N 上执行。
如果 Pod Q 真的被从其节点上移除,Pod 间反亲和性的规则就会得到满足,Pod P
就有可能被调度到节点 N 上执行。
Pod 优先级和抢占机制可能产生一些不想看到的副作用。
下面是一些可能存在的问题以及相应的处理方法。
Pod 被不必要地抢占
抢占操作会在集群中资源压力较大,进而无法为高优先级的悬决 Pod 腾出空间时发生。
如果你不小心给某些 Pod 赋予了较高优先级,这些意外获得高优先级的 Pod 可能导致
集群中出现抢占行为。Pod 优先级是通过在其规约中的 priorityClassName 来设定的。
优先级的整数值被解析出来后会添加到 Pod 规约的 priority 字段。
要解决这一问题,你可以修改这些 Pod 的 priorityClassName 设置,使用优先级
较低的优先级类,或者将该字段留空。空的 priorityClassName 默认解析为优先级 0。
Pod 被抢占时,被抢占的 Pod 会有对应的事件被记录下来。
只有集群中无法为某 Pod 提供足够资源的时候才会发生抢占。
在出现这种情况时,也只有悬决 Pod(抢占者)的优先级高于被牺牲的 Pod
的优先级时,才会发生抢占现象。
当没有悬决 Pod,或者悬决 Pod 的优先级等于或者低于现有 Pod 时,都不应发生抢占行为。
如果在这种条件下仍然发生了抢占,请登记一个 Issue。
Pod 被抢占但抢占者未被调度
当有 Pod 被抢占时,它们会得到各自的体面终止期限(默认为 30 秒)。
如果被牺牲的 Pod 在此限期内未能终止,则 Pod 会被强制终止
一旦所有被牺牲的 Pod 都已消失不见,抢占者 Pod 就可被调度。
在抢占者 Pod 等待被牺牲的 Pod 消失期间,可能有更高优先级的 Pod 被创建,且适合
调度到同一节点。如果是这种情况,调度器会调度优先级更高的 Pod 而不是抢占者。
这是期望发生的行为:优先级更高的 Pod 应该取代优先级较低的 Pod。
高优先级的 Pod 比低优先级的 Pod 先被抢占
调度器尝试寻找可以运行悬决 Pod 的节点。如果找不到这样的节点,调度器会尝试从任一
节点上逐出优先级较低的 Pod 以运行悬决 Pod。
如果包含低优先级 Pod 的节点不适合用来运行悬决 Pod,调度器可能会选择其他的、
运行着较高优先级(相对之前所评估的节点上的 Pod 而言)的 Pod 的节点来执行抢占操作。
即使如此,被牺牲的 Pod 的优先级也必须比抢占者 Pod 的优先级低。
当有多个节点可供抢占时,调度器会选择 Pod 集合的优先级最低的节点。不过如果这些
Pod 上定义了 PodDisruptionBudget(PDB)而且如果被抢占了的话就会违反 PDB,
则调度器会选择另一个 Pod 集合优先级稍高的节点。
当存在多个节点可供抢占,但以上场景都不适用,则调度器会选择优先级最低的节点。
Pod 优先级与服务质量间关系
Pod 优先级与 QoS 类 是两个
相互独立的功能特性,其间交互之处很少,并且不存在基于 Pod QoS 类来为其设置
优先级方面的默认限制。
调度器的抢占逻辑在选择抢占目标时不会考虑 QoS 因素。
抢占考虑的是 Pod 优先级,并选择优先级最低的 Pod 作为抢占目标。
只有移除最低优先级的 Pod 尚不足以允许调度器调度抢占者 Pod 或者最低优先级的 Pod
受到 Pod 干扰预算(PDB)保护时,才会考虑抢占优先级稍高的 Pod。
唯一同时考虑 QoS 和 Pod 优先级的组件是 kubelet,体现在其
资源不足时的逐出操作。
kubelet 首先根据 Pod 对濒危资源的使用是否超出其请求值来选择要被逐出的 Pod,
接下来对这些 Pod 按优先级排序,再按其相对 Pod 的调度请求所耗用的濒危资源的用量
排序。更多细节可参阅
逐出最终用户的 Pod。
kubelet 资源不足时的逐出操作不会逐出 Pod 资源用量未超出其请求值的 Pod。
如果优先级较低的 Pod 未超出其请求值,它们不会被逐出。其他优先级较高的
且用量超出请求值的 Pod 则可能被逐出。
Pod 安全策略(Pod Security Policy) 是集群级别的资源,它能够控制 Pod 规约
中与安全性相关的各个方面。
PodSecurityPolicy
对象定义了一组 Pod 运行时必须遵循的条件及相关字段的默认值,只有 Pod 满足这些条件
才会被系统接受。
Pod 安全策略允许管理员控制如下方面:
Pod 安全策略实现为一种可选(但是建议启用)的
准入控制器。
启用了准入控制器
即可强制实施 Pod 安全策略,不过如果没有授权认可策略之前即启用
准入控制器 将导致集群中无法创建任何 Pod。
由于 Pod 安全策略 API(policy/v1beta1/podsecuritypolicy)是独立于准入控制器
来启用的,对于现有集群而言,建议在启用准入控制器之前先添加策略并对其授权。
授权策略
PodSecurityPolicy 资源被创建时,并不执行任何操作。为了使用该资源,需要对
发出请求的用户或者目标 Pod 的
服务账号
授权,通过允许其对策略执行 use 动词允许其使用该策略。
大多数 Kubernetes Pod 不是由用户直接创建的。相反,这些 Pod 是由
Deployment、
ReplicaSet
或者经由控制器管理器模版化的控制器创建。
赋予控制器访问策略的权限意味着对应控制器所创建的 所有 Pod 都可访问策略。
因此,对策略进行授权的优先方案是为 Pod 的服务账号授予访问权限
(参见示例)。
apiVersion:policy/v1beta1kind:PodSecurityPolicymetadata:name:examplespec:privileged:false# Don't allow privileged pods!# The rest fills in some required fields.seLinux:rule:RunAsAnysupplementalGroups:rule:RunAsAnyrunAsUser:rule:RunAsAnyfsGroup:rule:RunAsAnyvolumes:- '*'
Error from server (Forbidden): error when creating "STDIN": pods "privileged" is forbidden: unable to validate against any pod security policy: [spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]
LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
1m 2m 15 pause-7774d79b5 ReplicaSet Warning FailedCreate replicaset-controller Error creating: pods "pause-7774d79b5-" is forbidden: no providers available to validate pod request
发生了什么? 我们已经为用户 fake-user 绑定了 psp:unprivileged 角色,
为什么还会收到错误 Error creating: pods "pause-7774d79b5-" is forbidden: no providers available to validate pod request (创建错误:pods "pause-7774d79b5" 被禁止:没有可用来验证 pod 请求的驱动)?
答案在于源文件 - replicaset-controller。
fake-user 用户成功地创建了 Deployment,而后者也成功地创建了 ReplicaSet,
不过当 ReplicaSet 创建 Pod 时,发现未被授权使用示例 PodSecurityPolicy 资源。
为了修复这一问题,将 psp:unprivileged 角色绑定到 Pod 的服务账号。
在这里,因为我们没有给出服务账号名称,默认的服务账号是 default。
apiVersion:policy/v1beta1kind:PodSecurityPolicymetadata:name:restrictedannotations:seccomp.security.alpha.kubernetes.io/allowedProfileNames:'docker/default,runtime/default'apparmor.security.beta.kubernetes.io/allowedProfileNames:'runtime/default'seccomp.security.alpha.kubernetes.io/defaultProfileName:'runtime/default'apparmor.security.beta.kubernetes.io/defaultProfileName:'runtime/default'spec:privileged:false# Required to prevent escalations to root.allowPrivilegeEscalation:false# This is redundant with non-root + disallow privilege escalation,# but we can provide it for defense in depth.requiredDropCapabilities:- ALL# Allow core volume types.volumes:- 'configMap'- 'emptyDir'- 'projected'- 'secret'- 'downwardAPI'# Assume that persistentVolumes set up by the cluster admin are safe to use.- 'persistentVolumeClaim'hostNetwork:falsehostIPC:falsehostPID:falserunAsUser:# Require the container to run without root privileges.rule:'MustRunAsNonRoot'seLinux:# This policy assumes the nodes are using AppArmor rather than SELinux.rule:'RunAsAny'supplementalGroups:rule:'MustRunAs'ranges:# Forbid adding the root group.- min:1max:65535fsGroup:rule:'MustRunAs'ranges:# Forbid adding the root group.- min:1max:65535readOnlyRootFilesystem:false
Pod 级别和节点级别的 PID 限制会设置硬性限制。
一旦触及限制值,工作负载会在尝试获得新的 PID 时开始遇到问题。
这可能会也可能不会导致 Pod 被重新调度,取决于工作负载如何应对这类失败
以及 Pod 的存活性和就绪态探测是如何配置的。
可是,如果限制值被正确设置,你可以确保其它 Pod 负载和系统进程不会因为某个
Pod 行为不正常而没有 PID 可用。
你可以约束一个 Pod 只能在特定的
节点 上运行。
有几种方法可以实现这点,推荐的方法都是用
标签选择算符来进行选择。
通常这样的约束不是必须的,因为调度器将自动进行合理的放置(比如,将 Pod 分散到节点上,
而不是将 Pod 放置在可用资源不足的节点上等等)。但在某些情况下,你可能需要进一步控制
Pod 停靠的节点,例如,确保 Pod 最终落在连接了 SSD 的机器上,或者将来自两个不同的服务
且有大量通信的 Pods 被放置在同一个可用区。
nodeSelector
nodeSelector 是节点选择约束的最简单推荐形式。nodeSelector 是 PodSpec 的一个字段。
它包含键值对的映射。为了使 pod 可以在某个节点上运行,该节点的标签中
必须包含这里的每个键值对(它也可以具有其他标签)。
最常见的用法的是一对键值对。
nodeSelector 提供了一种非常简单的方法来将 Pod 约束到具有特定标签的节点上。
亲和性/反亲和性功能极大地扩展了你可以表达约束的类型。关键的增强点包括:
语言更具表现力(不仅仅是“对完全匹配规则的 AND”)
你可以发现规则是“软需求”/“偏好”,而不是硬性要求,因此,
如果调度器无法满足该要求,仍然调度该 Pod
你可以使用节点上(或其他拓扑域中)的 Pod 的标签来约束,而不是使用
节点本身的标签,来允许哪些 pod 可以或者不可以被放置在一起。
亲和性功能包含两种类型的亲和性,即“节点亲和性”和“Pod 间亲和性/反亲和性”。
节点亲和性就像现有的 nodeSelector(但具有上面列出的前两个好处),然而
Pod 间亲和性/反亲和性约束 Pod 标签而不是节点标签(在上面列出的第三项中描述,
除了具有上面列出的第一和第二属性)。
节点亲和性
节点亲和性概念上类似于 nodeSelector,它使你可以根据节点上的标签来约束
Pod 可以调度到哪些节点。
目前有两种类型的节点亲和性,分别为 requiredDuringSchedulingIgnoredDuringExecution 和
preferredDuringSchedulingIgnoredDuringExecution。
你可以视它们为“硬需求”和“软需求”,意思是,前者指定了将 Pod 调度到一个节点上
必须满足的规则(就像 nodeSelector 但使用更具表现力的语法),
后者指定调度器将尝试执行但不能保证的偏好。
名称的“IgnoredDuringExecution”部分意味着,类似于 nodeSelector 的工作原理,
如果节点的标签在运行时发生变更,从而不再满足 Pod 上的亲和性规则,那么 Pod
将仍然继续在该节点上运行。
将来我们计划提供 requiredDuringSchedulingRequiredDuringExecution,
它将与 requiredDuringSchedulingIgnoredDuringExecution 完全相同,
只是它会将 Pod 从不再满足 Pod 的节点亲和性要求的节点上驱逐。
因此,requiredDuringSchedulingIgnoredDuringExecution 的示例将是
“仅将 Pod 运行在具有 Intel CPU 的节点上”,而
preferredDuringSchedulingIgnoredDuringExecution 的示例为
“尝试将这组 Pod 运行在 XYZ 故障区域,如果这不可能的话,则允许一些
Pod 在其他地方运行”。
Pod 间亲和性与反亲和性使你可以 基于已经在节点上运行的 Pod 的标签 来约束
Pod 可以调度到的节点,而不是基于节点上的标签。
规则的格式为“如果 X 节点上已经运行了一个或多个 满足规则 Y 的 Pod,
则这个 Pod 应该(或者在反亲和性的情况下不应该)运行在 X 节点”。
Y 表示一个具有可选的关联命令空间列表的 LabelSelector;
与节点不同,因为 Pod 是命名空间限定的(因此 Pod 上的标签也是命名空间限定的),
因此作用于 Pod 标签的标签选择算符必须指定选择算符应用在哪个命名空间。
从概念上讲,X 是一个拓扑域,如节点、机架、云供应商可用区、云供应商地理区域等。
你可以使用 topologyKey 来表示它,topologyKey 是节点标签的键以便系统
用来表示这样的拓扑域。
请参阅上面插曲:内置的节点标签部分中列出的标签键。
说明:
Pod 间亲和性与反亲和性需要大量的处理,这可能会显著减慢大规模集群中的调度。
我们不建议在超过数百个节点的集群中使用它们。
说明:
Pod 反亲和性需要对节点进行一致的标记,即集群中的每个节点必须具有适当的标签能够匹配
topologyKey。如果某些或所有节点缺少指定的 topologyKey 标签,可能会导致意外行为。
与节点亲和性一样,当前有两种类型的 Pod 亲和性与反亲和性,即
requiredDuringSchedulingIgnoredDuringExecution 和
preferredDuringSchedulingIgnoredDuringExecution,分别表示“硬性”与“软性”要求。
请参阅前面节点亲和性部分中的描述。
requiredDuringSchedulingIgnoredDuringExecution 亲和性的一个示例是
“将服务 A 和服务 B 的 Pod 放置在同一区域,因为它们之间进行大量交流”,而
preferredDuringSchedulingIgnoredDuringExecution 反亲和性的示例将是
“将此服务的 pod 跨区域分布”(硬性要求是说不通的,因为你可能拥有的
Pod 数多于区域数)。
Pod 间亲和性通过 PodSpec 中 affinity 字段下的 podAffinity 字段进行指定。
而 Pod 间反亲和性通过 PodSpec 中 affinity 字段下的 podAntiAffinity 字段进行指定。
在这个 Pod 的亲和性配置定义了一条 Pod 亲和性规则和一条 Pod 反亲和性规则。
在此示例中,podAffinity 配置为 requiredDuringSchedulingIgnoredDuringExecution,
然而 podAntiAffinity 配置为 preferredDuringSchedulingIgnoredDuringExecution。
Pod 亲和性规则表示,仅当节点和至少一个已运行且有键为“security”且值为“S1”的标签
的 Pod 处于同一区域时,才可以将该 Pod 调度到节点上。
(更确切的说,如果节点 N 具有带有键 topology.kubernetes.io/zone 和某个值 V 的标签,
则 Pod 有资格在节点 N 上运行,以便集群中至少有一个节点具有键
topology.kubernetes.io/zone 和值为 V 的节点正在运行具有键“security”和值
“S1”的标签的 pod。)
Pod 反亲和性规则表示,如果节点处于 Pod 所在的同一可用区且具有键“security”和值“S2”的标签,
则该 pod 不应将其调度到该节点上。
(如果 topologyKey 为 topology.kubernetes.io/zone,则意味着当节点和具有键
“security”和值“S2”的标签的 Pod 处于相同的区域,Pod 不能被调度到该节点上。)
查阅设计文档
以获取 Pod 亲和性与反亲和性的更多样例,包括
requiredDuringSchedulingIgnoredDuringExecution
和 preferredDuringSchedulingIgnoredDuringExecution 两种配置。
您可以给一个节点添加多个污点,也可以给一个 Pod 添加多个容忍度设置。
Kubernetes 处理多个污点和容忍度的过程就像一个过滤器:从一个节点的所有污点开始遍历,
过滤掉那些 Pod 中存在与之相匹配的容忍度的污点。余下未被过滤的污点的 effect 值决定了
Pod 是否会被分配到该节点,特别是以下情况:
如果未被过滤的污点中存在至少一个 effect 值为 NoSchedule 的污点,
则 Kubernetes 不会将 Pod 分配到该节点。
配置了 PreemptionPolicy: Never 的 Pod 将被放置在调度队列中较低优先级 Pod 之前,
但它们不能抢占其他 Pod。等待调度的非抢占式 Pod 将留在调度队列中,直到有足够的可用资源,
它才可以被调度。非抢占式 Pod,像其他 Pod 一样,受调度程序回退的影响。
这意味着如果调度程序尝试这些 Pod 并且无法调度它们,它们将以更低的频率被重试,
从而允许其他优先级较低的 Pod 排在它们之前。
非抢占式 Pod 仍可能被其他高优先级 Pod 抢占。
PreemptionPolicy 默认为 PreemptLowerPriority,
这将允许该 PriorityClass 的 Pod 抢占较低优先级的 Pod(现有默认行为也是如此)。
如果 PreemptionPolicy 设置为 Never,则该 PriorityClass 中的 Pod 将是非抢占式的。
数据科学工作负载是一个示例用例。用户可以提交他们希望优先于其他工作负载的作业,
但不希望因为抢占运行中的 Pod 而导致现有工作被丢弃。
设置为 PreemptionPolicy: Never 的高优先级作业将在其他排队的 Pod 之前被调度,
只要足够的集群资源“自然地”变得可用。
非抢占式 PriorityClass 示例
apiVersion:scheduling.k8s.io/v1kind:PriorityClassmetadata:name:high-priority-nonpreemptingvalue:1000000preemptionPolicy:NeverglobalDefault:falsedescription:"This priority class will not cause other pods to be preempted."
当启用 Pod 优先级时,调度程序会按优先级对悬决 Pod 进行排序,
并且每个悬决的 Pod 会被放置在调度队列中其他优先级较低的悬决 Pod 之前。
因此,如果满足调度要求,较高优先级的 Pod 可能会比具有较低优先级的 Pod 更早调度。
如果无法调度此类 Pod,调度程序将继续并尝试调度其他较低优先级的 Pod。
抢占
Pod 被创建后会进入队列等待调度。
调度器从队列中挑选一个 Pod 并尝试将它调度到某个节点上。
如果没有找到满足 Pod 的所指定的所有要求的节点,则触发对悬决 Pod 的抢占逻辑。
让我们将悬决 Pod 称为 P。抢占逻辑试图找到一个节点,
在该节点中删除一个或多个优先级低于 P 的 Pod,则可以将 P 调度到该节点上。
如果找到这样的节点,一个或多个优先级较低的 Pod 会被从节点中驱逐。
被驱逐的 Pod 消失后,P 可以被调度到该节点上。
用户暴露的信息
当 Pod P 抢占节点 N 上的一个或多个 Pod 时,
Pod P 状态的 nominatedNodeName 字段被设置为节点 N 的名称。
该字段帮助调度程序跟踪为 Pod P 保留的资源,并为用户提供有关其集群中抢占的信息。
请注意,Pod P 不一定会调度到“被提名的节点(Nominated Node)”。
在 Pod 因抢占而牺牲时,它们将获得体面终止期。
如果调度程序正在等待牺牲者 Pod 终止时另一个节点变得可用,
则调度程序将使用另一个节点来调度 Pod P。
因此,Pod 规约中的 nominatedNodeName 和 nodeName 并不总是相同。
此外,如果调度程序抢占节点 N 上的 Pod,但随后比 Pod P 更高优先级的 Pod 到达,
则调度程序可能会将节点 N 分配给新的更高优先级的 Pod。
在这种情况下,调度程序会清除 Pod P 的 nominatedNodeName。
通过这样做,调度程序使 Pod P 有资格抢占另一个节点上的 Pod。
抢占的限制
被抢占牺牲者的体面终止
当 Pod 被抢占时,牺牲者会得到他们的
体面终止期。
它们可以在体面终止期内完成工作并退出。如果它们不这样做就会被杀死。
这个体面终止期在调度程序抢占 Pod 的时间点和待处理的 Pod (P)
可以在节点 (N) 上调度的时间点之间划分出了一个时间跨度。
同时,调度器会继续调度其他待处理的 Pod。当牺牲者退出或被终止时,
调度程序会尝试在待处理队列中调度 Pod。
因此,调度器抢占牺牲者的时间点与 Pod P 被调度的时间点之间通常存在时间间隔。
为了最小化这个差距,可以将低优先级 Pod 的体面终止时间设置为零或一个小数字。
支持 PodDisruptionBudget,但不保证
PodDisruptionBudget
(PDB) 允许多副本应用程序的所有者限制因自愿性质的干扰而同时终止的 Pod 数量。
Kubernetes 在抢占 Pod 时支持 PDB,但对 PDB 的支持是基于尽力而为原则的。
调度器会尝试寻找不会因被抢占而违反 PDB 的牺牲者,但如果没有找到这样的牺牲者,
抢占仍然会发生,并且即使违反了 PDB 约束也会删除优先级较低的 Pod。
与低优先级 Pod 之间的 Pod 间亲和性
只有当这个问题的答案是肯定的时,才考虑在一个节点上执行抢占操作:
“如果从此节点上删除优先级低于悬决 Pod 的所有 Pod,悬决 Pod 是否可以在该节点上调度?”
说明: 抢占并不一定会删除所有较低优先级的 Pod。
如果悬决 Pod 可以通过删除少于所有较低优先级的 Pod 来调度,
那么只有一部分较低优先级的 Pod 会被删除。
即便如此,上述问题的答案必须是肯定的。
如果答案是否定的,则不考虑在该节点上执行抢占。
如果悬决 Pod 与节点上的一个或多个较低优先级 Pod 具有 Pod 间亲和性,
则在没有这些较低优先级 Pod 的情况下,无法满足 Pod 间亲和性规则。
在这种情况下,调度程序不会抢占节点上的任何 Pod。
相反,它寻找另一个节点。调度程序可能会找到合适的节点,
也可能不会。无法保证悬决 Pod 可以被调度。
我们针对此问题推荐的解决方案是仅针对同等或更高优先级的 Pod 设置 Pod 间亲和性。
跨节点抢占
假设正在考虑在一个节点 N 上执行抢占,以便可以在 N 上调度待处理的 Pod P。
只有当另一个节点上的 Pod 被抢占时,P 才可能在 N 上变得可行。
下面是一个例子:
正在考虑将 Pod P 调度到节点 N 上。
Pod Q 正在与节点 N 位于同一区域的另一个节点上运行。
Pod P 与 Pod Q 具有 Zone 维度的反亲和(topologyKey:topology.kubernetes.io/zone)。
Pod P 与 Zone 中的其他 Pod 之间没有其他反亲和性设置。
为了在节点 N 上调度 Pod P,可以抢占 Pod Q,但调度器不会进行跨节点抢占。
因此,Pod P 将被视为在节点 N 上不可调度。
如果将 Pod Q 从所在节点中移除,则不会违反 Pod 间反亲和性约束,
并且 Pod P 可能会被调度到节点 N 上。
如果有足够的需求,并且如果我们找到性能合理的算法,
我们可能会考虑在未来版本中添加跨节点抢占。
故障排除
Pod 优先级和抢占可能会产生不必要的副作用。以下是一些潜在问题的示例以及处理这些问题的方法。
Pod 被不必要地抢占
抢占在资源压力较大时从集群中删除现有 Pod,为更高优先级的悬决 Pod 腾出空间。
如果你错误地为某些 Pod 设置了高优先级,这些无意的高优先级 Pod 可能会导致集群中出现抢占行为。
Pod 优先级是通过设置 Pod 规约中的 priorityClassName 字段来指定的。
优先级的整数值然后被解析并填充到 podSpec 的 priority 字段。
为了解决这个问题,你可以将这些 Pod 的 priorityClassName 更改为使用较低优先级的类,
或者将该字段留空。默认情况下,空的 priorityClassName 解析为零。
当 Pod 被抢占时,集群会为被抢占的 Pod 记录事件。只有当集群没有足够的资源用于 Pod 时,
才会发生抢占。在这种情况下,只有当悬决 Pod(抢占者)的优先级高于受害 Pod 时才会发生抢占。
当没有悬决 Pod,或者悬决 Pod 的优先级等于或低于牺牲者时,不得发生抢占。
如果在这种情况下发生抢占,请提出问题。
有 Pod 被抢占,但抢占者并没有被调度
当 Pod 被抢占时,它们会收到请求的体面终止期,默认为 30 秒。
如果受害 Pod 在此期限内没有终止,它们将被强制终止。
一旦所有牺牲者都离开,就可以调度抢占者 Pod。
在抢占者 Pod 等待牺牲者离开的同时,可能某个适合同一个节点的更高优先级的 Pod 被创建。
在这种情况下,调度器将调度优先级更高的 Pod 而不是抢占者。
这是预期的行为:具有较高优先级的 Pod 应该取代具有较低优先级的 Pod。
优先级较高的 Pod 在优先级较低的 Pod 之前被抢占
调度程序尝试查找可以运行悬决 Pod 的节点。如果没有找到这样的节点,
调度程序会尝试从任意节点中删除优先级较低的 Pod,以便为悬决 Pod 腾出空间。
如果具有低优先级 Pod 的节点无法运行悬决 Pod,
调度器可能会选择另一个具有更高优先级 Pod 的节点(与其他节点上的 Pod 相比)进行抢占。
牺牲者的优先级必须仍然低于抢占者 Pod。
当有多个节点可供执行抢占操作时,调度器会尝试选择具有一组优先级最低的 Pod 的节点。
但是,如果此类 Pod 具有 PodDisruptionBudget,当它们被抢占时,
则会违反 PodDisruptionBudget,那么调度程序可能会选择另一个具有更高优先级 Pod 的节点。
当存在多个节点抢占且上述场景均不适用时,调度器会选择优先级最低的节点。
Pod 优先级和服务质量之间的相互作用
Pod 优先级和 QoS 类
是两个正交特征,交互很少,并且对基于 QoS 类设置 Pod 的优先级没有默认限制。
调度器的抢占逻辑在选择抢占目标时不考虑 QoS。
抢占会考虑 Pod 优先级并尝试选择一组优先级最低的目标。
仅当移除优先级最低的 Pod 不足以让调度程序调度抢占式 Pod,
或者最低优先级的 Pod 受 PodDisruptionBudget 保护时,才会考虑优先级较高的 Pod。
kubelet 使用优先级来确定
资源不足时驱逐 Pod 的顺序。
你可以使用 QoS 类来估计 Pod 最有可能被驱逐的顺序。kubelet 根据以下因素对 Pod 进行驱逐排名:
首先考虑资源使用量超过其请求的 BestEffort 或 Burstable Pod。
这些 Pod 会根据它们的优先级以及它们的资源使用级别超过其请求的程度被逐出。
资源使用量少于请求量的 Guaranteed Pod 和 Burstable Pod 根据其优先级被最后驱逐。
说明:
kubelet 不使用 Pod 的 QoS 类来确定驱逐顺序。
在回收内存等资源时,你可以使用 QoS 类来估计最可能的 Pod 驱逐顺序。
QoS 不适用于临时存储(EphemeralStorage)请求,
因此如果节点在 DiskPressure 下,则上述场景将不适用。
仅当 Guaranteed Pod 中所有容器都被指定了请求和限制并且二者相等时,才保证 Pod 不被驱逐。
这些 Pod 永远不会因为另一个 Pod 的资源消耗而被驱逐。
如果系统守护进程(例如 kubelet、docker 和 journald)
消耗的资源比通过 system-reserved 或 kube-reserved 分配保留的资源多,
并且该节点只有 Guaranteed 或 Burstable Pod 使用的资源少于其上剩余的请求,
那么 kubelet 必须选择驱逐这些 Pod 中的一个以保持节点稳定性并减少资源匮乏对其他 Pod 的影响。
在这种情况下,它会选择首先驱逐最低优先级的 Pod。
Kubelet 主动监测和防止
计算资源的全面短缺。在资源短缺时,kubelet 可以主动地结束一个或多个 Pod
以回收短缺的资源。
当 kubelet 结束一个 Pod 时,它将终止 Pod 中的所有容器,而 Pod 的 Phase
将变为 Failed。
如果被驱逐的 Pod 由 Deployment 管理,这个 Deployment 会创建另一个 Pod 给
Kubernetes 来调度。
namespace/development created
namespace/staging created
configmap/my-config created
deployment.apps/my-deployment created
persistentvolumeclaim/my-pvc created
NAME READY STATUS RESTARTS AGE TIER
my-nginx-2035384211-j5fhi 1/1 Running 0 23m fe
my-nginx-2035384211-u2c7e 1/1 Running 0 23m fe
my-nginx-2035384211-u3t6x 1/1 Running 0 23m fe
kubectl get deployment my-nginx -o yaml > /tmp/nginx.yaml
vi /tmp/nginx.yaml
# do some edit, and then save the file
kubectl apply -f /tmp/nginx.yaml
deployment.apps/my-nginx configured
rm /tmp/nginx.yaml
这个模型不仅不复杂,而且还和 Kubernetes 的实现廉价的从虚拟机向容器迁移的初衷相兼容,
如果你的工作开始是在虚拟机中运行的,你的虚拟机有一个 IP ,
这样就可以和其他的虚拟机进行通信,这是基本相同的模型。
Kubernetes 的 IP 地址存在于 Pod 范围内 - 容器共享它们的网络命名空间 - 包括它们的 IP 地址和 MAC 地址。
这就意味着 Pod 内的容器都可以通过 localhost 到达各个端口。
这也意味着 Pod 内的容器都需要相互协调端口的使用,但是这和虚拟机中的进程似乎没有什么不同,
这也被称为“一个 Pod 一个 IP”模型。
如何实现这一点是正在使用的容器运行时的特定信息。
也可以在 node 本身通过端口去请求你的 Pod(称之为主机端口),
但这是一个很特殊的操作。转发方式如何实现也是容器运行时的细节。
Pod 自己并不知道这些主机端口是否存在。
Antrea 项目是一个开源的联网解决方案,旨在成为
Kubernetes 原生的网络解决方案。它利用 Open vSwitch 作为网络数据平面。
Open vSwitch 是一个高性能可编程的虚拟交换机,支持 Linux 和 Windows 平台。
Open vSwitch 使 Antrea 能够以高性能和高效的方式实现 Kubernetes 的网络策略。
借助 Open vSwitch 可编程的特性,Antrea 能够在 Open vSwitch 之上实现广泛的联网、安全功能和服务。
Apstra 的 AOS
AOS 是一个基于意图的网络系统,
可以通过一个简单的集成平台创建和管理复杂的数据中心环境。
AOS 利用高度可扩展的分布式设计来消除网络中断,同时将成本降至最低。
AOS 参考设计当前支持三层连接的主机,这些主机消除了旧的两层连接的交换问题。
这些三层连接的主机可以是 Linux(Debian、Ubuntu、CentOS)系统,
它们直接在机架式交换机(TOR)的顶部创建 BGP 邻居关系。
AOS 自动执行路由邻接,然后提供对 Kubernetes 部署中常见的路由运行状况注入(RHI)的精细控制。
AOS 具有一组丰富的 REST API 端点,这些端点使 Kubernetes 能够根据应用程序需求快速更改网络策略。
进一步的增强功能将用于网络设计的 AOS Graph 模型与工作负载供应集成在一起,
从而为私有云和公共云提供端到端管理系统。
AOS 支持使用包括 Cisco、Arista、Dell、Mellanox、HPE 在内的制造商提供的通用供应商设备,
以及大量白盒系统和开放网络操作系统,例如 Microsoft SONiC、Dell OPX 和 Cumulus Linux。
想要更详细地了解 AOS 系统是如何工作的可以点击这里:https://www.apstra.com/products/how-it-works/
使用该 CNI 插件,可使 Kubernetes Pod 拥有与在 VPC 网络上相同的 IP 地址。
CNI 将 AWS 弹性网络接口(ENI)分配给每个 Kubernetes 节点,并将每个 ENI 的辅助 IP 范围用于该节点上的 Pod 。
CNI 包含用于 ENI 和 IP 地址的预分配的控件,以便加快 Pod 的启动时间,并且能够支持多达 2000 个节点的大型集群。
apiVersion:v1kind:ConfigMapmetadata:name:fluentd-configdata:fluentd.conf:| <source>
type tail
format none
path /var/log/1.log
pos_file /var/log/1.log.pos
tag count.format1
</source>
<source>
type tail
format none
path /var/log/2.log
pos_file /var/log/2.log.pos
tag count.format2
</source>
<match **>
type google_cloud
</match>
kubectl get flowschemas -o custom-columns="uid:{metadata.uid},name:{metadata.name}"
kubectl get prioritylevelconfigurations -o custom-columns="uid:{metadata.uid},name:{metadata.name}"
聚合层在 kube-apiserver 进程内运行。在扩展资源注册之前,聚合层不做任何事情。
要注册 API,用户必须添加一个 APIService 对象,用它来“申领” Kubernetes API 中的 URL 路径。
自此以后,聚合层将会把发给该 API 路径的所有内容(例如 /apis/myextension.mycompany.io/v1/…)
转发到已注册的 APIService。
APIService 的最常见实现方式是在集群中某 Pod 内运行 扩展 API 服务器。
如果你在使用扩展 API 服务器来管理集群中的资源,该扩展 API 服务器(也被写成“extension-apiserver”)
一般需要和一个或多个控制器一起使用。
apiserver-builder 库同时提供构造扩展 API 服务器和控制器框架代码。
反应延迟
扩展 API 服务器与 kube-apiserver 之间需要存在低延迟的网络连接。
发现请求需要在五秒钟或更短的时间内完成到 kube-apiserver 的往返。
Kubenet 创建名为 cbr0 的网桥,并为每个 pod 创建了一个 veth 对,
每个 Pod 的主机端都连接到 cbr0。
这个 veth 对的 Pod 端会被分配一个 IP 地址,该 IP 地址隶属于节点所被分配的 IP
地址范围内。节点的 IP 地址范围则通过配置或控制器管理器来设置。
cbr0 被分配一个 MTU,该 MTU 匹配主机上已启用的正常接口的最小 MTU。
服务代理(Service Broker)是由Open Service Broker API 规范定义的一组托管服务的端点,这些服务由第三方提供并维护,其中的第三方可以是 AWS、GCP 或 Azure 等云服务提供商。
托管服务的一些示例是 Microsoft Azure Cloud Queue、Amazon Simple Queue Service 和 Google Cloud Pub/Sub,但它们可以是应用程序能够使用的任何软件交付物。
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED
admin.conf Dec 30, 2020 23:36 UTC 364d no
apiserver Dec 30, 2020 23:36 UTC 364d ca no
apiserver-etcd-client Dec 30, 2020 23:36 UTC 364d etcd-ca no
apiserver-kubelet-client Dec 30, 2020 23:36 UTC 364d ca no
controller-manager.conf Dec 30, 2020 23:36 UTC 364d no
etcd-healthcheck-client Dec 30, 2020 23:36 UTC 364d etcd-ca no
etcd-peer Dec 30, 2020 23:36 UTC 364d etcd-ca no
etcd-server Dec 30, 2020 23:36 UTC 364d etcd-ca no
front-proxy-client Dec 30, 2020 23:36 UTC 364d front-proxy-ca no
scheduler.conf Dec 30, 2020 23:36 UTC 364d no
CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED
ca Dec 28, 2029 23:36 UTC 9y no
etcd-ca Dec 28, 2029 23:36 UTC 9y no
front-proxy-ca Dec 28, 2029 23:36 UTC 9y no
# 将 x 替换为你为此次升级所选择的补丁版本号
sudo kubeadm upgrade apply v1.25.x
一旦该命令结束,你应该会看到:
[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.25.x". Enjoy!
[upgrade/kubelet] Now that your control plane is upgraded, please proceed with upgrading your kubelets if you haven't already done so.
本章介绍怎样为命名空间配置默认的 CPU 请求和限制。
一个 Kubernetes 集群可被划分为多个命名空间。如果在配置了 CPU 限制的命名空间创建容器,
并且该容器没有声明自己的 CPU 限制,那么这个容器会被指定默认的 CPU 限制。
Kubernetes 在一些特定情况还会指定 CPU 请求,本文后续章节将会对其进行解释。
Error from server (Forbidden): error when creating "examples/admin/resource/memory-constraints-pod-2.yaml":
pods "constraints-mem-demo-2" is forbidden: maximum memory usage per Container is 1Gi, but limit is 1536Mi.
尝试创建一个不满足最小内存请求的 Pod
这里给出了包含一个容器的 Pod 的配置文件。容器声明了100 MiB 的内存请求和800 MiB 的内存限制。
Error from server (Forbidden): error when creating "examples/admin/resource/memory-constraints-pod-3.yaml":
pods "constraints-mem-demo-3" is forbidden: minimum memory usage per Container is 500Mi, but request is 100Mi.
Error from server (Forbidden): error when creating "examples/admin/resource/cpu-constraints-pod-2.yaml":
pods "constraints-cpu-demo-2" is forbidden: maximum cpu usage per Container is 800m, but limit is 1500m.
尝试创建一个不满足最小 CPU 请求的 Pod
这里给出了包含一个容器的 Pod 的配置文件。该容器声明了100 millicpu的 CPU 请求和800 millicpu的 CPU 限制。
Error from server (Forbidden): error when creating "examples/admin/resource/cpu-constraints-pod-3.yaml":
pods "constraints-cpu-demo-4" is forbidden: minimum cpu usage per Container is 200m, but request is 100m.
第二个 Pod 不能被创建成功。输出结果显示创建第二个 Pod 会导致内存请求总量超过内存请求配额。
Error from server (Forbidden): error when creating "examples/admin/resource/quota-mem-cpu-pod-2.yaml":
pods "quota-mem-cpu-demo-2" is forbidden: exceeded quota: mem-cpu-demo,
requested: requests.memory=700Mi,used: requests.memory=600Mi, limited: requests.memory=1Gi
curl -LO https://storage.googleapis.com/kubernetes-release/easy-rsa/easy-rsa.tar.gz
tar xzf easy-rsa.tar.gz
cd easy-rsa-master/easyrsa3
./easyrsa init-pki
configmap/cilium-config created
serviceaccount/cilium created
serviceaccount/cilium-operator created
clusterrole.rbac.authorization.k8s.io/cilium created
clusterrole.rbac.authorization.k8s.io/cilium-operator created
clusterrolebinding.rbac.authorization.k8s.io/cilium created
clusterrolebinding.rbac.authorization.k8s.io/cilium-operator created
daemonset.apps/cilium create
deployment.apps/cilium-operator created
入门指南其余的部分用一个示例应用说明了如何强制执行 L3/L4(即 IP 地址+端口)的安全策略
以及L7 (如 HTTP)的安全策略。
ip-masq-agent 配置 iptables 规则,以便在将流量发送到集群节点的 IP 和集群 IP 范围之外的目标时
处理伪装节点或 Pod 的 IP 地址。这本质上隐藏了集群节点 IP 地址后面的 Pod IP 地址。
在某些环境中,去往“外部”地址的流量必须从已知的机器地址发出。
例如,在 Google Cloud 中,任何到互联网的流量都必须来自 VM 的 IP。
使用容器时,如 Google Kubernetes Engine,从 Pod IP 发出的流量将被拒绝出站。
为了避免这种情况,我们必须将 Pod IP 隐藏在 VM 自己的 IP 地址后面 - 通常称为“伪装”。
默认情况下,代理配置为将
RFC 1918
指定的三个私有 IP 范围视为非伪装
CIDR。
这些范围是 10.0.0.0/8,172.16.0.0/12 和 192.168.0.0/16。
默认情况下,代理还将链路本地地址(169.254.0.0/16)视为非伪装 CIDR。
代理程序配置为每隔 60 秒从 /etc/config/ip-masq-agent 重新加载其配置,
这也是可修改的。
10.0.0.0/8、172.16.0.0/12 和 192.168.0.0/16 范围内的流量不会被伪装。
任何其他流量(假设是互联网)将被伪装。
Pod 访问本地目的地的例子,可以是其节点的 IP 地址、另一节点的地址或集群的 IP 地址范围内的一个 IP 地址。
默认情况下,任何其他流量都将伪装。以下条目展示了 ip-masq-agent 的默认使用的规则:
iptables -t nat -L IP-MASQ-AGENT
RETURN all -- anywhere 169.254.0.0/16 /* ip-masq-agent: cluster-local traffic should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
RETURN all -- anywhere 10.0.0.0/8 /* ip-masq-agent: cluster-local traffic should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
RETURN all -- anywhere 172.16.0.0/12 /* ip-masq-agent: cluster-local traffic should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
RETURN all -- anywhere 192.168.0.0/16 /* ip-masq-agent: cluster-local traffic should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
MASQUERADE all -- anywhere anywhere /* ip-masq-agent: outbound traffic should be subject to MASQUERADE (this match must come after cluster-local CIDR matches) */ ADDRTYPE match dst-type !LOCAL
iptables -t nat -L IP-MASQ-AGENT
Chain IP-MASQ-AGENT (1 references)
target prot opt source destination
RETURN all -- anywhere 169.254.0.0/16 /* ip-masq-agent: cluster-local traffic should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
RETURN all -- anywhere 10.0.0.0/8 /* ip-masq-agent: cluster-local
MASQUERADE all -- anywhere anywhere /* ip-masq-agent: outbound traffic should be subject to MASQUERADE (this match must come after cluster-local CIDR matches) */ ADDRTYPE match dst-type !LOCAL
集群中节点的云服务信息将不再能够从本地元数据中获取,取而代之的是所有获取节点信息的
API 调用都将通过云管理控制器。这意味着你可以通过限制到 kubelet 云服务 API 的访问来提升安全性。
在更大的集群中你可能需要考虑云管理控制器是否会遇到速率限制,
因为它现在负责集群中几乎所有到云服务的 API 调用。
云管理控制器可以实现:
节点控制器 - 负责使用云服务 API 更新 kubernetes 节点并删除在云服务上已经删除的 kubernetes 节点。
服务控制器 - 负责在云服务上为类型为 LoadBalancer 的 service 提供负载均衡器。
# This is an example of how to setup cloud-controller-manager as a Daemonset in your cluster.# It assumes that your masters can run pods and has the role node-role.kubernetes.io/master# Note that this Daemonset will not work straight out of the box for your cloud, this is# meant to be a guideline.---apiVersion:v1kind:ServiceAccountmetadata:name:cloud-controller-managernamespace:kube-system---apiVersion:rbac.authorization.k8s.io/v1kind:ClusterRoleBindingmetadata:name:system:cloud-controller-managerroleRef:apiGroup:rbac.authorization.k8s.iokind:ClusterRolename:cluster-adminsubjects:- kind:ServiceAccountname:cloud-controller-managernamespace:kube-system---apiVersion:apps/v1kind:DaemonSetmetadata:labels:k8s-app:cloud-controller-managername:cloud-controller-managernamespace:kube-systemspec:selector:matchLabels:k8s-app:cloud-controller-managertemplate:metadata:labels:k8s-app:cloud-controller-managerspec:serviceAccountName:cloud-controller-managercontainers:- name:cloud-controller-manager# for in-tree providers we use k8s.gcr.io/cloud-controller-manager# this can be replaced with any other image for out-of-tree providersimage:k8s.gcr.io/cloud-controller-manager:v1.8.0command:- /usr/local/bin/cloud-controller-manager- --cloud-provider=[YOUR_CLOUD_PROVIDER] # Add your own cloud provider here!- --leader-elect=true- --use-service-account-credentials# these flags will vary for every cloud provider- --allocate-node-cidrs=true- --configure-cloud-routes=true- --cluster-cidr=172.17.0.0/16tolerations:# this is required so CCM can bootstrap itself- key:node.cloudprovider.kubernetes.io/uninitializedvalue:"true"effect:NoSchedule# this is to have the daemonset runnable on master nodes# the taint may vary depending on your cluster setup- key:node-role.kubernetes.io/mastereffect:NoSchedule# this is to restrict CCM to only run on master nodes# the node selector may vary depending on your cluster setupnodeSelector:node-role.kubernetes.io/master:""
说明: 注意 client-go 定义了自己的 API 对象,因此如果需要,请从 client-go 而不是主仓库导入
API 定义,例如 import "k8s.io/client-go/kubernetes" 是正确做法。
Go 客户端可以使用与 kubectl 命令行工具相同的
kubeconfig 文件
定位和验证 API 服务器。参见这个
例子:
package main
import (
"context""fmt""k8s.io/apimachinery/pkg/apis/meta/v1""k8s.io/client-go/kubernetes""k8s.io/client-go/tools/clientcmd"
)
funcmain() {
// uses the current context in kubeconfig
// path-to-kubeconfig -- for example, /root/.kube/config
config, _ := clientcmd.BuildConfigFromFlags("", "<path-to-kubeconfig>")
// creates the clientset
clientset, _ := kubernetes.NewForConfig(config)
// access the API to list pods
pods, _ := clientset.CoreV1().Pods("").List(context.TODO(), v1.ListOptions{})
fmt.Printf("There are %d pods in the cluster\n", len(pods.Items))
}
Python 客户端可以使用与 kubectl 命令行工具相同的
kubeconfig 文件
定位和验证 API 服务器。参见这个
例子:
fromkubernetesimport client, config
config.load_kube_config()
v1=client.CoreV1Api()
print("Listing pods with their IPs:")
ret = v1.list_pod_for_all_namespaces(watch=False)
for i in ret.items:
print("%s\t%s\t%s"% (i.status.pod_ip, i.metadata.namespace, i.metadata.name))
Java 客户端可以使用 kubectl 命令行所使用的
kubeconfig 文件
以定位 API 服务器并向其认证身份。
参看此示例:
packageio.kubernetes.client.examples;importio.kubernetes.client.ApiClient;importio.kubernetes.client.ApiException;importio.kubernetes.client.Configuration;importio.kubernetes.client.apis.CoreV1Api;importio.kubernetes.client.models.V1Pod;importio.kubernetes.client.models.V1PodList;importio.kubernetes.client.util.ClientBuilder;importio.kubernetes.client.util.KubeConfig;importjava.io.FileReader;importjava.io.IOException;/**
* A simple example of how to use the Java API from an application outside a kubernetes cluster
*
* <p>Easiest way to run this: mvn exec:java
* -Dexec.mainClass="io.kubernetes.client.examples.KubeConfigFileClientExample"
*
*/publicclassKubeConfigFileClientExample{publicstaticvoidmain(String[] args)throws IOException, ApiException {// file path to your KubeConfig
String kubeConfigPath ="~/.kube/config";// loading the out-of-cluster config, a kubeconfig from file-system
ApiClient client =
ClientBuilder.kubeconfig(KubeConfig.loadKubeConfig(new FileReader(kubeConfigPath))).build();// set the global default api-client to the in-cluster one from above
Configuration.setDefaultApiClient(client);// the CoreV1Api loads default api-client from global configuration.
CoreV1Api api =new CoreV1Api();// invokes the CoreV1Api client
V1PodList list = api.listPodForAllNamespaces(null,null,null,null,null,null,null,null,null);
System.out.println("Listing all pods: ");for(V1Pod item : list.getItems()){
System.out.println(item.getMetadata().getName());}}}
Pod 定义包含了一个安全上下文,
用于描述允许它请求访问某个节点上的特定 Linux 用户(如 root)、获得特权或访问主机网络、
以及允许它在主机节点上不受约束地运行的其它控件。
Pod 安全策略
可以限制哪些用户或服务帐户可以提供危险的安全上下文设置。
例如,Pod 的安全策略可以限制卷挂载,尤其是 hostpath,这些都是 Pod 应该控制的一些方面。
上述命令创建了一个带有一个 nginx 的 Deployment,并将之通过名为 nginx 的
Service 暴露出来。名为 nginx 的 Pod 和 Deployment 都位于 default
名字空间内。
kubectl get svc,pod
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/kubernetes 10.100.0.1 <none> 443/TCP 46m
svc/nginx 10.100.0.16 <none> 80/TCP 33s
NAME READY STATUS RESTARTS AGE
po/nginx-701339712-e0qfq 1/1 Running 0 35s
通过从 Pod 访问服务对其进行测试
你应该可以从其它的 Pod 访问这个新的 nginx 服务。
要从 default 命名空间中的其它s Pod 来访问该服务。可以启动一个 busybox 容器:
kubectl run busybox --rm -ti --image=busybox /bin/sh
在你的 Shell 中,运行下面的命令:
wget --spider --timeout=1 nginx
Connecting to nginx (10.100.0.16:80)
remote file exists
默认情况下,kubelet 使用 CFS 配额
来执行 Pod 的 CPU 约束。
当节点上运行了很多 CPU 密集的 Pod 时,工作负载可能会迁移到不同的 CPU 核,
这取决于调度时 Pod 是否被扼制,以及哪些 CPU 核是可用的。
许多工作负载对这种迁移不敏感,因此无需任何干预即可正常工作。
然而,有些工作负载的性能明显地受到 CPU 缓存亲和性以及调度延迟的影响。
对此,kubelet 提供了可选的 CPU 管理策略,来确定节点上的一些分配偏好。
配置
CPU 管理策略通过 kubelet 参数 --cpu-manager-policy 来指定。支持两种策略:
none: 默认策略,表示现有的调度行为。
static: 允许为节点上具有某些资源特征的 pod 赋予增强的 CPU 亲和性和独占性。
CPU 管理器定期通过 CRI 写入资源更新,以保证内存中 CPU 分配与 cgroupfs 一致。
同步频率通过新增的 Kubelet 配置参数 --cpu-manager-reconcile-period 来设置。
如果不指定,默认与 --node-status-update-frequency 的周期相同。
none 策略
none 策略显式地启用现有的默认 CPU 亲和方案,不提供操作系统调度器默认行为之外的亲和性策略。
通过 CFS 配额来实现 Guaranteed pods
的 CPU 使用限制。
static 策略
static 策略针对具有整数型 CPU requests 的 Guaranteed Pod ,它允许该类 Pod
中的容器访问节点上的独占 CPU 资源。这种独占性是使用
cpuset cgroup 控制器 来实现的。
说明: 诸如容器运行时和 kubelet 本身的系统服务可以继续在这些独占 CPU 上运行。独占性仅针对其他 Pod。
说明: CPU 管理器不支持运行时下线和上线 CPUs。此外,如果节点上的在线 CPUs 集合发生变化,
则必须驱逐节点上的 Pod,并通过删除 kubelet 根目录中的状态文件 cpu_manager_state
来手动重置 CPU 管理器。
该策略管理一个共享 CPU 资源池,最初,该资源池包含节点上所有的 CPU 资源。可用
的独占性 CPU 资源数量等于节点的 CPU 总量减去通过 --kube-reserved 或 --system-reserved 参数保留的 CPU 。从1.17版本开始,CPU保留列表可以通过 kublet 的 '--reserved-cpus' 参数显式地设置。
通过 '--reserved-cpus' 指定的显式CPU列表优先于使用 '--kube-reserved' 和 '--system-reserved' 参数指定的保留CPU。 通过这些参数预留的 CPU 是以整数方式,按物理内
核 ID 升序从初始共享池获取的。 共享池是 BestEffort 和 Burstable pod 运行
的 CPU 集合。Guaranteed pod 中的容器,如果声明了非整数值的 CPU requests ,也将运行在共享池的 CPU 上。只有 Guaranteed pod 中,指定了整数型 CPU requests 的容器,才会被分配独占 CPU 资源。
说明: 当启用 static 策略时,要求使用 --kube-reserved 和/或 --system-reserved 或
--reserved-cpus 来保证预留的 CPU 值大于零。
这是因为零预留 CPU 值可能使得共享池变空。
当 Guaranteed Pod 调度到节点上时,如果其容器符合静态分配要求,
相应的 CPU 会被从共享池中移除,并放置到容器的 cpuset 中。
因为这些容器所使用的 CPU 受到调度域本身的限制,所以不需要使用 CFS 配额来进行 CPU 的绑定。
换言之,容器 cpuset 中的 CPU 数量与 Pod 规约中指定的整数型 CPU limit 相等。
这种静态分配增强了 CPU 亲和性,减少了 CPU 密集的工作负载在节流时引起的上下文切换。
考虑以下 Pod 规格的容器:
spec:containers:- name:nginximage:nginx
该 Pod 属于 BestEffort QoS 类型,因为其未指定 requests 或 limits 值。
所以该容器运行在共享 CPU 池中。
在引入拓扑管理器之前, Kubernetes 中的 CPU 和设备管理器相互独立地做出资源分配决策。
这可能会导致在多处理系统上出现并非期望的资源分配;由于这些与期望相左的分配,对性能或延迟敏感的应用将受到影响。
这里的不符合期望意指,例如, CPU 和设备是从不同的 NUMA 节点分配的,因此会导致额外的延迟。
在该作用域内,拓扑管理器依次进行一系列的资源对齐,
也就是,对每一个容器(包含在一个 Pod 里)计算单独的对齐。
换句话说,在该特定的作用域内,没有根据特定的 NUMA 节点集来把容器分组的概念。
实际上,拓扑管理器会把单个容器任意地对齐到 NUMA 节点上。
容器分组的概念是在以下的作用域内特别实现的,也就是 pod 作用域。
Pod 作用域
使用命令行选项 --topology-manager-scope=pod 来启动 kubelet,就可以选择 pod 作用域。
该作用域允许把一个 Pod 里的所有容器作为一个分组,分配到一个共同的 NUMA 节点集。
也就是,拓扑管理器会把一个 Pod 当成一个整体,
并且试图把整个 Pod(所有容器)分配到一个单个的 NUMA 节点或者一个共同的 NUMA 节点集。
以下的例子说明了拓扑管理器在不同的场景下使用的对齐方式:
所有容器可以被分配到一个单一的 NUMA 节点;
所有容器可以被分配到一个共享的 NUMA 节点集。
整个 Pod 所请求的某种资源总量是根据
有效 request/limit
公式来计算的,
因此,对某一种资源而言,该总量等于以下数值中的最大值:
所有应用容器请求之和;
初始容器请求的最大值。
pod 作用域与 single-numa-node 拓扑管理器策略一起使用,
对于延时敏感的工作负载,或者对于进行 IPC 的高吞吐量应用程序,都是特别有价值的。
把这两个选项组合起来,你可以把一个 Pod 里的所有容器都放到一个单个的 NUMA 节点,
使得该 Pod 消除了 NUMA 之间的通信开销。
在 single-numa-node 策略下,只有当可能的分配方案中存在合适的 NUMA 节点集时,Pod 才会被接受。
重新考虑上述的例子:
节点集只包含单个 NUMA 节点时,Pod 就会被接受,
然而,节点集包含多个 NUMA 节点时,Pod 就会被拒绝
(因为满足该分配方案需要两个或以上的 NUMA 节点,而不是单个 NUMA 节点)。
简要地说,拓扑管理器首先计算出 NUMA 节点集,然后使用拓扑管理器策略来测试该集合,
从而决定拒绝或者接受 Pod。
说明: 如果拓扑管理器配置使用 Pod 作用域,
那么在策略考量一个容器时,该容器反映的是整个 Pod 的要求,
于是该 Pod 里的每个容器都会得到 相同的 拓扑对齐决定。
none 策略
这是默认策略,不执行任何拓扑对齐。
best-effort 策略
对于 Guaranteed 类的 Pod 中的每个容器,具有 best-effort 拓扑管理策略的
kubelet 将调用每个建议提供者以确定资源可用性。
使用此信息,拓扑管理器存储该容器的首选 NUMA 节点亲和性。
如果亲和性不是首选,则拓扑管理器将存储该亲和性,并且无论如何都将 pod 接纳到该节点。
之后 建议提供者 可以在进行资源分配决策时使用这个信息。
restricted 策略
对于 Guaranteed 类 Pod 中的每个容器, 配置了 restricted 拓扑管理策略的 kubelet
调用每个建议提供者以确定其资源可用性。。
使用此信息,拓扑管理器存储该容器的首选 NUMA 节点亲和性。
如果亲和性不是首选,则拓扑管理器将从节点中拒绝此 Pod 。
这将导致 Pod 处于 Terminated 状态,且 Pod 无法被节点接纳。
一旦 Pod 处于 Terminated 状态,Kubernetes 调度器将不会尝试重新调度该 Pod。
建议使用 ReplicaSet 或者 Deployment 来重新部署 Pod。
还可以通过实现外部控制环,以启动对具有 Topology Affinity 错误的 Pod 的重新部署。
如果 Pod 被允许运行在某节点,则 建议提供者 可以在做出资源分配决定时使用此信息。
single-numa-node 策略
对于 Guaranteed 类 Pod 中的每个容器, 配置了 single-numa-nodde 拓扑管理策略的
kubelet 调用每个建议提供者以确定其资源可用性。
使用此信息,拓扑管理器确定单 NUMA 节点亲和性是否可能。
如果是这样,则拓扑管理器将存储此信息,然后 建议提供者 可以在做出资源分配决定时使用此信息。
如果不可能,则拓扑管理器将拒绝 Pod 运行于该节点。
这将导致 Pod 处于 Terminated 状态,且 Pod 无法被节点接受。
一旦 Pod 处于 Terminated 状态,Kubernetes 调度器将不会尝试重新调度该 Pod。
建议使用 ReplicaSet 或者 Deployment 来重新部署 Pod。
还可以通过实现外部控制环,以触发具有 Topology Affinity 错误的 Pod 的重新部署。
Pod 与拓扑管理器策略的交互
考虑以下 pod 规范中的容器:
spec:containers:- name:nginximage:nginx
该 Pod 以 BestEffort QoS 类运行,因为没有指定资源 requests 或 limits。
启动第二个主节点副本时,将创建一个包含两个副本的负载均衡器,
并将第一个副本的 IP 地址提升为负载均衡器的 IP 地址。
类似地,在删除倒数第二个主节点副本之后,将删除负载均衡器,
并将其 IP 地址分配给最后一个剩余的副本。
请注意,创建和删除负载均衡器是复杂的操作,可能需要一些时间(约20分钟)来同步。
apiVersion:apps/v1kind:Deploymentmetadata:name:dns-autoscalernamespace:kube-systemlabels:k8s-app:dns-autoscalerspec:selector:matchLabels:k8s-app:dns-autoscalertemplate:metadata:labels:k8s-app:dns-autoscalerspec:containers:- name:autoscalerimage:k8s.gcr.io/cluster-proportional-autoscaler-amd64:1.6.0resources:requests:cpu:20mmemory:10Micommand:- /cluster-proportional-autoscaler- --namespace=kube-system- --configmap=dns-autoscaler- --target=<SCALE_TARGET># When cluster is using large nodes(with more cores), "coresPerReplica" should dominate.# If using small nodes, "nodesPerReplica" should dominate.- --default-params={"linear":{"coresPerReplica":256,"nodesPerReplica":16,"min":1}}- --logtostderr=true- --v=2
Kubernetes master is running at https://104.197.5.247
elasticsearch-logging is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/elasticsearch-logging/proxy
kibana-logging is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/kibana-logging/proxy
kube-dns is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/kube-dns/proxy
grafana is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/monitoring-grafana/proxy
heapster is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/monitoring-heapster/proxy
Name: default
Labels: <none>
Annotations: <none>
Status: Active
No resource quota.
Resource Limits
Type Resource Min Max Default
---- -------- --- --- ---
Container cpu - - 100m
请注意,这些详情同时显示了资源配额(如果存在)以及资源限制区间。
资源配额跟踪并聚合 Namespace 中资源的使用情况,并允许集群运营者定义 Namespace 可能消耗的 Hard 资源使用限制。
kubelet 将尝试在驱逐终端用户 pod 前回收节点层级资源。
发现磁盘压力时,如果节点针对容器运行时配置有独占的 imagefs,kubelet回收节点层级资源的方式将会不同。
使用 imagefs
如果 nodefs 文件系统满足驱逐阈值,kubelet通过驱逐 pod 及其容器来释放磁盘空间。
如果 imagefs 文件系统满足驱逐阈值,kubelet通过删除所有未使用的镜像来释放磁盘空间。
未使用 imagefs
如果 nodefs 满足驱逐阈值,kubelet将以下面的顺序释放磁盘空间:
删除停止运行的 pod/container
删除全部没有使用的镜像
驱逐最终用户的 pod
如果 kubelet 在节点上无法回收足够的资源,kubelet将开始驱逐 pod。
kubelet 首先根据他们对短缺资源的使用是否超过请求来排除 pod 的驱逐行为,
然后通过优先级,
然后通过相对于 pod 的调度请求消耗急需的计算资源。
kubelet 按以下顺序对要驱逐的 pod 排名:
BestEffort 或 Burstable,其对短缺资源的使用超过了其请求,此类 pod 按优先级排序,然后使用高于请求。
Guaranteed pod 和 Burstable pod,其使用率低于请求,最后被驱逐。
Guaranteed Pod 只有为所有的容器指定了要求和限制并且它们相等时才能得到保证。
由于另一个 Pod 的资源消耗,这些 Pod 保证永远不会被驱逐。
如果系统守护进程(例如 kubelet、docker、和 journald)消耗的资源多于通过
system-reserved 或 kube-reserved 分配保留的资源,并且该节点只有 Guaranteed 或
Burstable Pod 使用少于剩余的请求,然后节点必须选择驱逐这样的 Pod
以保持节点的稳定性并限制意外消耗对其他 pod 的影响。
在这种情况下,它将首先驱逐优先级最低的 pod。
必要时,kubelet会在遇到 DiskPressure 时逐个驱逐 Pod 来回收磁盘空间。
如果 kubelet 响应 inode 短缺,它会首先驱逐服务质量最低的 Pod 来回收 inodes。
如果 kubelet 响应缺少可用磁盘,它会将 Pod 排在服务质量范围内,该服务会消耗大量的磁盘并首先结束这些磁盘。
使用 imagefs
如果是 nodefs 触发驱逐,kubelet将按 nodefs 用量 - 本地卷 + pod 的所有容器日志的总和对其排序。
如果是 imagefs 触发驱逐,kubelet将按 pod 所有可写层的用量对其进行排序。
未使用 imagefs
如果是 nodefs 触发驱逐,kubelet会根据磁盘的总使用情况对 pod 进行排序 - 本地卷 + 所有容器的日志及其可写层。
最小驱逐回收
在某些场景,驱逐 pod 会导致回收少量资源。这将导致 kubelet 反复碰到驱逐阈值。除此之外,对如 disk 这类资源的驱逐时比较耗时的。
kubectl get pod memory-demo-2 --namespace=mem-example
输出结果显示:容器被杀掉、重启、再杀掉、再重启……:
kubectl get pod memory-demo-2 --namespace=mem-example
NAME READY STATUS RESTARTS AGE
memory-demo-2 0/1 OOMKilled 1 37s
kubectl get pod memory-demo-2 --namespace=mem-example
NAME READY STATUS RESTARTS AGE
memory-demo-2 1/1 Running 2 40s
查看关于该 Pod 历史的详细信息:
kubectl describe pod memory-demo-2 --namespace=mem-example
输出结果显示:该容器反复的在启动和失败:
... Normal Created Created container with id 66a3a20aa7980e61be4922780bf9d24d1a1d8b7395c09861225b0eba1b1f8511
... Warning BackOff Back-off restarting failed container
查看关于集群节点的详细信息:
kubectl describe nodes
输出结果包含了一条练习中的容器由于内存溢出而被杀掉的记录:
Warning OOMKilling Memory cgroup out of memory: Kill process 4481 (stress) score 1994 or sacrifice child
删除 Pod:
kubectl delete pod memory-demo-2 --namespace=mem-example
超过整个节点容量的内存
内存请求和限制是与容器关联的,但将 Pod 视为具有内存请求和限制,也是很有用的。
Pod 的内存请求是 Pod 中所有容器的内存请求之和。
同理,Pod 的内存限制是 Pod 中所有容器的内存限制之和。
Pod 的调度基于请求。只有当节点拥有足够满足 Pod 内存请求的内存时,才会将 Pod 调度至节点上运行。
在本练习中,你将创建一个 Pod,其内存请求超过了你集群中的任意一个节点所拥有的内存。
这是该 Pod 的配置文件,其拥有一个请求 1000 GiB 内存的容器,这应该超过了你集群中任何节点的容量。
kubectl get pod memory-demo-3 --namespace=mem-example
输出结果显示:Pod 处于 PENDING 状态。
这意味着,该 Pod 没有被调度至任何节点上运行,并且它会无限期的保持该状态:
kubectl get pod memory-demo-3 --namespace=mem-example
NAME READY STATUS RESTARTS AGE
memory-demo-3 0/1 Pending 0 25s
查看关于 Pod 的详细信息,包括事件:
kubectl describe pod memory-demo-3 --namespace=mem-example
输出结果显示:由于节点内存不足,该容器无法被调度:
Events:
... Reason Message
------ -------
... FailedScheduling No nodes are available that match all of the following predicates:: Insufficient memory (3).
本页展示如何为将运行在 Windows 节点上的 Pod 和容器配置
组管理的服务账号(Group Managed Service Accounts,GMSA)。
组管理的服务账号是活动目录(Active Directory)的一种特殊类型,提供自动化的
密码管理、简化的服务主体名称(Service Principal Name,SPN)管理以及跨多个
服务器将管理操作委派给其他管理员等能力。
在 Kubernetes 环境中,GMSA 凭据规约配置为 Kubernetes 集群范围的自定义资源
(Custom Resources)形式。Windows Pod 以及各 Pod 中的每个容器可以配置为
使用 GMSA 来完成基于域(Domain)的操作(例如,Kerberos 身份认证),以便
与其他 Windows 服务相交互。自 Kubernetes 1.16 版本起,Docker 运行时为
Windows 负载支持 GMSA。
准备开始
你需要一个 Kubernetes 集群,以及 kubectl 命令行工具,且工具必须已配置
为能够与你的集群通信。集群预期包含 Windows 工作节点。
本节讨论需要为每个集群执行一次的初始操作。
PS C:\> nltest /sc_reset:domain.example
Flags: 30 HAS_IP HAS_TIMESERV
Trusted DC Name \\dc10.domain.example
Trusted DC Connection Status Status=0 0x0 NERR_Success
The command completed successfully
PS C:\>
如果上述命令修复了错误,你就可以通过向你的 Pod 规约添加生命周期回调来将此操作
自动化。如果上述命令未能奏效,你就需要再次检查凭据规约,以确保其数据时正确的
而且是完整的。
Conditions:
Type Status
PodScheduled False
...
Events:
...
... Warning FailedScheduling pod (extended-resource-demo-2) failed to fit in any node
fit failure summary on nodes : Insufficient example.com/dongle (1)
查看 Pod 的状态:
kubectl get pod extended-resource-demo-2
输出结果表明 Pod 虽然被创建了,但没有被调度到节点上正常运行。Pod 的状态为 Pending:
NAME READY STATUS RESTARTS AGE
extended-resource-demo-2 0/1 Pending 0 6m
清理
删除本练习中创建的 Pod:
kubectl delete pod extended-resource-demo
kubectl delete pod extended-resource-demo-2
要为 Pod 设置安全性设置,可在 Pod 规约中包含 securityContext 字段。securityContext 字段值是一个
PodSecurityContext
对象。你为 Pod 所设置的安全性配置会应用到 Pod 中所有 Container 上。
下面是一个 Pod 的配置文件,该 Pod 定义了 securityContext 和一个 emptyDir 卷:
警告: 在为 Pod 设置 MCS 标签之后,所有带有相同标签的 Pod 可以访问该卷。
如果你需要跨 Pod 的保护,你必须为每个 Pod 赋予独特的 MCS 标签。
清理
删除之前创建的所有 Pod:
kubectl delete pod security-context-demo
kubectl delete pod security-context-demo-2
kubectl delete pod security-context-demo-3
kubectl delete pod security-context-demo-4
当你(自然人)访问集群时(例如,使用 kubectl),API 服务器将你的身份验证为
特定的用户帐户(当前这通常是 admin,除非你的集群管理员已经定制了你的集群配置)。
Pod 内的容器中的进程也可以与 api 服务器接触。
当它们进行身份验证时,它们被验证为特定的服务帐户(例如,default)。
在很多场合,Kubernetes API 服务器都不会暴露在公网上,不过对于缓存并向外提供 API
服务器响应数据的公开末端而言,用户或者服务提供商可以选择将其暴露在公网上。
在这种环境中,可能会重载 OpenID 提供者配置中的
jwks_uri,使之指向公网上可用的末端地址,而不是 API 服务器的地址。
这时需要向 API 服务器传递 --service-account-jwks-uri 参数。
与分发者 URL 类似,此 JWKS URI 也需要使用 https 模式。
kubelet 使用就绪探测器可以知道容器什么时候准备好了并可以开始接受请求流量, 当一个 Pod
内的所有容器都准备好了,才能把这个 Pod 看作就绪了。
这种信号的一个用途就是控制哪个 Pod 作为 Service 的后端。
在 Pod 还没有准备好的时候,会从 Service 的负载均衡器中被剔除的。
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
24s 24s 1 {default-scheduler } Normal Scheduled Successfully assigned liveness-exec to worker0
23s 23s 1 {kubelet worker0} spec.containers{liveness} Normal Pulling pulling image "k8s.gcr.io/busybox"
23s 23s 1 {kubelet worker0} spec.containers{liveness} Normal Pulled Successfully pulled image "k8s.gcr.io/busybox"
23s 23s 1 {kubelet worker0} spec.containers{liveness} Normal Created Created container with docker id 86849c15382e; Security:[seccomp=unconfined]
23s 23s 1 {kubelet worker0} spec.containers{liveness} Normal Started Started container with docker id 86849c15382e
35 秒之后,再来看 Pod 的事件:
kubectl describe pod liveness-exec
在输出结果的最下面,有信息显示存活探测器失败了,这个容器被杀死并且被重建了。
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
37s 37s 1 {default-scheduler } Normal Scheduled Successfully assigned liveness-exec to worker0
36s 36s 1 {kubelet worker0} spec.containers{liveness} Normal Pulling pulling image "k8s.gcr.io/busybox"
36s 36s 1 {kubelet worker0} spec.containers{liveness} Normal Pulled Successfully pulled image "k8s.gcr.io/busybox"
36s 36s 1 {kubelet worker0} spec.containers{liveness} Normal Created Created container with docker id 86849c15382e; Security:[seccomp=unconfined]
36s 36s 1 {kubelet worker0} spec.containers{liveness} Normal Started Started container with docker id 86849c15382e
2s 2s 1 {kubelet worker0} spec.containers{liveness} Warning Unhealthy Liveness probe failed: cat: can't open '/tmp/healthy': No such file or directory
再等另外 30 秒,检查看这个容器被重启了:
kubectl get pod liveness-exec
输出结果显示 RESTARTS 的值增加了 1。
NAME READY STATUS RESTARTS AGE
liveness-exec 1/1 Running 1 1m
定义一个存活态 HTTP 请求接口
另外一种类型的存活探测方式是使用 HTTP GET 请求。
下面是一个 Pod 的配置文件,其中运行一个基于 k8s.gcr.io/liveness 镜像的容器。
NAME STATUS AGE VERSION
worker0 Ready 1d v1.6.0+fff5156
worker1 Ready 1d v1.6.0+fff5156
worker2 Ready 1d v1.6.0+fff5156
选择其中一个节点,为它添加标签:
kubectl label nodes <your-node-name> disktype=ssd
<your-node-name> 是你选择的节点的名称。
验证你选择的节点是否有 disktype=ssd 标签:
kubectl get nodes --show-labels
输出类似如下:
NAME STATUS AGE VERSION LABELS
worker0 Ready 1d v1.6.0+fff5156 ...,disktype=ssd,kubernetes.io/hostname=worker0
worker1 Ready 1d v1.6.0+fff5156 ...,kubernetes.io/hostname=worker1
worker2 Ready 1d v1.6.0+fff5156 ...,kubernetes.io/hostname=worker2
在前面的输出中,你可以看到 worker0 节点有 disktype=ssd 标签。
创建一个调度到你选择的节点的 pod
此 Pod 配置文件描述了一个拥有节点选择器 disktype: ssd 的 Pod。这表明该 Pod 将被调度到
有 disktype=ssd 标签的节点。
apiVersion:v1kind:Podmetadata:name:init-demospec:containers:- name:nginximage:nginxports:- containerPort:80volumeMounts:- name:workdirmountPath:/usr/share/nginx/html# These containers are run during pod initializationinitContainers:- name:installimage:busyboxcommand:- wget- "-O"- "/work-dir/index.html"- http://info.cern.chvolumeMounts:- name:workdirmountPath:"/work-dir"dnsPolicy:Defaultvolumes:- name:workdiremptyDir:{}
<html><head></head><body><header>
<title>http://info.cern.ch</title>
</header>
<h1>http://info.cern.ch - home of the first website</h1>
...
<li><a href="http://info.cern.ch/hypertext/WWW/TheProject.html">Browse the first website</a></li>
...
apiVersion:v1kind:Podmetadata:name:lifecycle-demospec:containers:- name:lifecycle-demo-containerimage:nginxlifecycle:postStart:exec:command:["/bin/sh","-c","echo Hello from the postStart handler > /usr/share/message"]preStop:exec:command:["/bin/sh","-c","nginx -s quit; while killall -0 nginx; do sleep 1; done"]
apiVersion:v1kind:Podmetadata:name:dapi-test-podspec:containers:- name:test-containerimage:k8s.gcr.io/busyboxcommand:["/bin/sh","-c","env"]env:# Define the environment variable- name:SPECIAL_LEVEL_KEYvalueFrom:configMapKeyRef:# The ConfigMap containing the value you want to assign to SPECIAL_LEVEL_KEYname:special-config# Specify the key associated with the valuekey:special.howrestartPolicy:Never
apiVersion:v1kind:Podmetadata:name:dapi-test-podspec:containers:- name:test-containerimage:k8s.gcr.io/busyboxcommand:["/bin/sh","-c","ls /etc/config/"]volumeMounts:- name:config-volumemountPath:/etc/configvolumes:- name:config-volumeconfigMap:# Provide the name of the ConfigMap containing the files you want# to add to the containername:special-configrestartPolicy:Never
LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
0s 0s 1 dapi-test-pod Pod Warning InvalidEnvironmentVariableNames {kubelet, 127.0.0.1} Keys [1badkey, 2alsobad] from the EnvFrom configMap default/myconfig were skipped since they are considered invalid environment variable names.
ConfigMap 位于特定的名字空间
中。每个 ConfigMap 只能被同一名字空间中的 Pod 引用.
当 kubelet 启动时,会自动启动所有定义的静态 Pod。
当定义了一个静态 Pod 并重新启动 kubelet 时,新的静态 Pod 就应该已经在运行了。
可以在节点上运行下面的命令来查看正在运行的容器(包括静态 Pod):
# 在 kubelet 运行的节点上执行以下命令
docker ps
输出可能会像这样:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f6d05272b57e nginx:latest "nginx" 8 minutes ago Up 8 minutes k8s_web.6f802af4_static-web-fk-node1_default_67e24ed9466ba55986d120c867395f3c_378e5f3c
可以在 API 服务上看到镜像 Pod:
kubectl get pods
NAME READY STATUS RESTARTS AGE
static-web-my-node1 1/1 Running 0 2m
说明: 要确保 kubelet 在 API 服务上有创建镜像 Pod 的权限。如果没有,创建请求会被 API 服务拒绝。
可以看Pod安全策略。
INFO Kubernetes file "frontend-service.yaml" created
INFO Kubernetes file "frontend-service.yaml" created
INFO Kubernetes file "frontend-service.yaml" created
INFO Kubernetes file "redis-master-service.yaml" created
INFO Kubernetes file "redis-master-service.yaml" created
INFO Kubernetes file "redis-master-service.yaml" created
INFO Kubernetes file "redis-slave-service.yaml" created
INFO Kubernetes file "redis-slave-service.yaml" created
INFO Kubernetes file "redis-slave-service.yaml" created
INFO Kubernetes file "frontend-deployment.yaml" created
INFO Kubernetes file "frontend-deployment.yaml" created
INFO Kubernetes file "frontend-deployment.yaml" created
INFO Kubernetes file "redis-master-deployment.yaml" created
INFO Kubernetes file "redis-master-deployment.yaml" created
INFO Kubernetes file "redis-master-deployment.yaml" created
INFO Kubernetes file "redis-slave-deployment.yaml" created
INFO Kubernetes file "redis-slave-deployment.yaml" created
INFO Kubernetes file "redis-slave-deployment.yaml" created
service/frontend created
service/redis-master created
service/redis-slave created
deployment.apps/frontend created
deployment.apps/redis-master created
deployment.apps/redis-slave created
WARN Unsupported key networks - ignoring
WARN Unsupported key build - ignoring
INFO Kubernetes file "worker-svc.yaml" created
INFO Kubernetes file "db-svc.yaml" created
INFO Kubernetes file "redis-svc.yaml" created
INFO Kubernetes file "result-svc.yaml" created
INFO Kubernetes file "vote-svc.yaml" created
INFO Kubernetes file "redis-deployment.yaml" created
INFO Kubernetes file "result-deployment.yaml" created
INFO Kubernetes file "vote-deployment.yaml" created
INFO Kubernetes file "worker-deployment.yaml" created
INFO Kubernetes file "db-deployment.yaml" created
INFO Kubernetes file "frontend-service.yaml" created
INFO Kubernetes file "mlbparks-service.yaml" created
INFO Kubernetes file "mongodb-service.yaml" created
INFO Kubernetes file "redis-master-service.yaml" created
INFO Kubernetes file "redis-slave-service.yaml" created
INFO Kubernetes file "frontend-deployment.yaml" created
INFO Kubernetes file "mlbparks-deployment.yaml" created
INFO Kubernetes file "mongodb-deployment.yaml" created
INFO Kubernetes file "mongodb-claim0-persistentvolumeclaim.yaml" created
INFO Kubernetes file "redis-master-deployment.yaml" created
INFO Kubernetes file "redis-slave-deployment.yaml" created
WARN [worker] Service cannot be created because of missing port.
INFO OpenShift file "vote-service.yaml" created
INFO OpenShift file "db-service.yaml" created
INFO OpenShift file "redis-service.yaml" created
INFO OpenShift file "result-service.yaml" created
INFO OpenShift file "vote-deploymentconfig.yaml" created
INFO OpenShift file "vote-imagestream.yaml" created
INFO OpenShift file "worker-deploymentconfig.yaml" created
INFO OpenShift file "worker-imagestream.yaml" created
INFO OpenShift file "db-deploymentconfig.yaml" created
INFO OpenShift file "db-imagestream.yaml" created
INFO OpenShift file "redis-deploymentconfig.yaml" created
INFO OpenShift file "redis-imagestream.yaml" created
INFO OpenShift file "result-deploymentconfig.yaml" created
INFO OpenShift file "result-imagestream.yaml" created
WARN [foo] Service cannot be created because of missing port.
INFO OpenShift Buildconfig using git@github.com:rtnpro/kompose.git::master as source.
INFO OpenShift file "foo-deploymentconfig.yaml" created
INFO OpenShift file "foo-imagestream.yaml" created
INFO OpenShift file "foo-buildconfig.yaml" created
Kompose 支持通过 kompose up 直接将你的"复合的(composed)" 应用程序
部署到 Kubernetes 或 OpenShift。
Kubernetes kompose up 示例
kompose --file ./examples/docker-guestbook.yml up
We are going to create Kubernetes deployments and services for your Dockerized application.
If you need different kind of resources, use the 'kompose convert' and 'kubectl create -f' commands instead.
INFO Successfully created service: redis-master
INFO Successfully created service: redis-slave
INFO Successfully created service: frontend
INFO Successfully created deployment: redis-master
INFO Successfully created deployment: redis-slave
INFO Successfully created deployment: frontend
Your application has been deployed to Kubernetes. You can run 'kubectl get deployment,svc,pods' for details.
kubectl get deployment,svc,pods
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deployment.extensions/frontend 1 1 1 1 4m
deployment.extensions/redis-master 1 1 1 1 4m
deployment.extensions/redis-slave 1 1 1 1 4m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/frontend ClusterIP 10.0.174.12 <none> 80/TCP 4m
service/kubernetes ClusterIP 10.0.0.1 <none> 443/TCP 13d
service/redis-master ClusterIP 10.0.202.43 <none> 6379/TCP 4m
service/redis-slave ClusterIP 10.0.1.85 <none> 6379/TCP 4m
NAME READY STATUS RESTARTS AGE
pod/frontend-2768218532-cs5t5 1/1 Running 0 4m
pod/redis-master-1432129712-63jn8 1/1 Running 0 4m
pod/redis-slave-2504961300-nve7b 1/1 Running 0 4m
kompose --file ./examples/docker-guestbook.yml --provider openshift up
We are going to create OpenShift DeploymentConfigs and Services for your Dockerized application.
If you need different kind of resources, use the 'kompose convert' and 'oc create -f' commands instead.
INFO Successfully created service: redis-slave
INFO Successfully created service: frontend
INFO Successfully created service: redis-master
INFO Successfully created deployment: redis-slave
INFO Successfully created ImageStream: redis-slave
INFO Successfully created deployment: frontend
INFO Successfully created ImageStream: frontend
INFO Successfully created deployment: redis-master
INFO Successfully created ImageStream: redis-master
Your application has been deployed to OpenShift. You can run 'oc get dc,svc,is' for details.
oc get dc,svc,is
NAME REVISION DESIRED CURRENT TRIGGERED BY
dc/frontend 0 1 0 config,image(frontend:v4)
dc/redis-master 0 1 0 config,image(redis-master:e2e)
dc/redis-slave 0 1 0 config,image(redis-slave:v1)
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/frontend 172.30.46.64 <none> 80/TCP 8s
svc/redis-master 172.30.144.56 <none> 6379/TCP 8s
svc/redis-slave 172.30.75.245 <none> 6379/TCP 8s
NAME DOCKER REPO TAGS UPDATED
is/frontend 172.30.12.200:5000/fff/frontend
is/redis-master 172.30.12.200:5000/fff/redis-master
is/redis-slave 172.30.12.200:5000/fff/redis-slave v1
你一旦将"复合(composed)" 应用部署到 Kubernetes,kompose down
命令将能帮你通过删除 Deployment 和 Service 对象来删除应用。
如果需要删除其他资源,请使用 'kubectl' 命令。
kompose --file docker-guestbook.yml down
INFO Successfully deleted service: redis-master
INFO Successfully deleted deployment: redis-master
INFO Successfully deleted service: redis-slave
INFO Successfully deleted deployment: redis-slave
INFO Successfully deleted service: frontend
INFO Successfully deleted deployment: frontend
INFO Build key detected. Attempting to build and push image 'docker.io/foo/bar'
INFO Building image 'docker.io/foo/bar' from directory 'build'
INFO Image 'docker.io/foo/bar' from directory 'build' built successfully
INFO Pushing image 'foo/bar:latest' to registry 'docker.io'
INFO Attempting authentication credentials 'https://index.docker.io/v1/
INFO Successfully pushed image 'foo/bar:latest' to registry 'docker.io'
INFO We are going to create Kubernetes Deployments, Services and PersistentVolumeClaims for your Dockerized application. If you need different kind of resources, use the 'kompose convert' and 'kubectl create -f' commands instead.
INFO Deploying application in "default" namespace
INFO Successfully created Service: foo
INFO Successfully created Deployment: foo
Your application has been deployed to Kubernetes. You can run 'kubectl get deployment,svc,pods,pvc' for details.
INFO Kubernetes file "redis-svc.json" created
INFO Kubernetes file "web-svc.json" created
INFO Kubernetes file "redis-deployment.json" created
INFO Kubernetes file "web-deployment.json" created
*-deployment.json 文件中包含 Deployment 对象。
kompose convert --replication-controller
INFO Kubernetes file "redis-svc.yaml" created
INFO Kubernetes file "web-svc.yaml" created
INFO Kubernetes file "redis-replicationcontroller.yaml" created
INFO Kubernetes file "web-replicationcontroller.yaml" created
INFO Kubernetes file "redis-svc.yaml" created
INFO Kubernetes file "web-svc.yaml" created
INFO Kubernetes file "redis-daemonset.yaml" created
INFO Kubernetes file "web-daemonset.yaml" created
INFO Kubernetes file "web-svc.yaml" created
INFO Kubernetes file "redis-svc.yaml" created
INFO Kubernetes file "web-deployment.yaml" created
INFO Kubernetes file "redis-deployment.yaml" created
chart created in "./docker-compose/"
kind:Deploymentmetadata:annotations:# ...# This is the json representation of simple_deployment.yaml# It was written by kubectl apply when the object was createdkubectl.kubernetes.io/last-applied-configuration:| {"apiVersion":"apps/v1","kind":"Deployment",
"metadata":{"annotations":{},"name":"nginx-deployment","namespace":"default"},
"spec":{"minReadySeconds":5,"selector":{"matchLabels":{"app":nginx}},"template":{"metadata":{"labels":{"app":"nginx"}},
"spec":{"containers":[{"image":"nginx:1.14.2","name":"nginx",
"ports":[{"containerPort":80}]}]}}}}# ...spec:# ...minReadySeconds:5selector:matchLabels:# ...app:nginxtemplate:metadata:# ...labels:app:nginxspec:containers:- image:nginx:1.14.2# ...name:nginxports:- containerPort:80# ...# ...# ...# ...
kubectl get -f https://k8s.io/examples/application/simple_deployment.yaml -o yaml
输出显示 API 在现时配置中为某些字段设置了默认值。
这些字段在配置文件中并未设置。
apiVersion:apps/v1kind:Deployment# ...spec:selector:matchLabels:app:nginxminReadySeconds:5replicas:1# API 服务器所设默认值strategy:rollingUpdate:# API 服务器基于 strategy.type 所设默认值maxSurge:1maxUnavailable:1type:RollingUpdate# API 服务器所设默认值template:metadata:creationTimestamp:nulllabels:app:nginxspec:containers:- image:nginx:1.14.2imagePullPolicy:IfNotPresent # API 服务器所设默认值name:nginxports:- containerPort:80protocol:TCP # API 服务器所设默认值resources:{}# API 服务器所设默认值terminationMessagePath:/dev/termination-log # API 服务器所设默认值dnsPolicy:ClusterFirst # API 服务器所设默认值restartPolicy:Always # API 服务器所设默认值securityContext:{}# API 服务器所设默认值terminationGracePeriodSeconds:30# API 服务器所设默认值# ...
有些时候,Pod 中运行的应用可能需要使用来自其他对象的配置值。
例如,某 Deployment 对象的 Pod 需要从环境变量或命令行参数中读取读取
Service 的名称。
由于在 kustomization.yaml 文件中添加 namePrefix 或 nameSuffix 时
Service 名称可能发生变化,建议不要在命令参数中硬编码 Service 名称。
对于这种使用场景,Kustomize 可以通过 vars 将 Service 名称注入到容器中。
"io.k8s.api.core.v1.PodSpec": {
..."containers": {
"description": "List of containers belonging to the pod. ...
},
"x-kubernetes-patch-merge-key": "name",
"x-kubernetes-patch-strategy": "merge"
},
"io.k8s.api.apps.v1.DeploymentSpec": {
..."strategy": {
"$ref": "#/definitions/io.k8s.api.apps.v1.DeploymentStrategy",
"description": "The deployment strategy to use to replace existing pods with new ones.",
"x-kubernetes-patch-strategy": "retainKeys"
},
apiVersion:v1kind:Podmetadata:name:envar-demolabels:purpose:demonstrate-envarsspec:containers:- name:envar-demo-containerimage:gcr.io/google-samples/node-hello:1.0env:- name:DEMO_GREETINGvalue:"Hello from the environment"- name:DEMO_FAREWELLvalue:"Such a sweet sorrow"
NAME READY STATUS RESTARTS AGE
envar-demo 1/1 Running 0 9s
列出 Pod 容器的环境变量:
kubectl exec envar-demo -- printenv
打印结果应为:
NODE_VERSION=4.4.2
EXAMPLE_SERVICE_PORT_8080_TCP_ADDR=10.3.245.237
HOSTNAME=envar-demo
...
DEMO_GREETING=Hello from the environment
DEMO_FAREWELL=Such a sweet sorrow
您在 Pod 的配置中定义的环境变量可以在配置的其他地方使用,
例如可用在为 Pod 的容器设置的命令和参数中。
在下面的示例配置中,环境变量 GREETING ,HONORIFIC 和 NAME 分别设置为 Warm greetings to ,
The Most Honorable 和 Kubernetes。然后这些环境变量在传递给容器 env-print-demo 的 CLI 参数中使用。
apiVersion:v1kind:Podmetadata:name:secret-test-podspec:containers:- name:test-containerimage:nginxvolumeMounts:# name must match the volume name below- name:secret-volumemountPath:/etc/secret-volume# The secret data is exposed to Containers in the Pod through a Volume.volumes:- name:secret-volumesecret:secretName:test-secret
创建 Pod:
kubectl create -f secret-pod.yaml
确认 Pod 正在运行:
kubectl get pod secret-test-pod
输出:
NAME READY STATUS RESTARTS AGE
secret-test-pod 1/1 Running 0 42m
apiVersion:apps/v1kind:Deploymentmetadata:name:nginx-deploymentspec:selector:matchLabels:app:nginxreplicas:2# tells deployment to run 2 pods matching the templatetemplate:metadata:labels:app:nginxspec:containers:- name:nginximage:nginx:1.14.2ports:- containerPort:80
apiVersion:apps/v1kind:Deploymentmetadata:name:nginx-deploymentspec:selector:matchLabels:app:nginxreplicas:2template:metadata:labels:app:nginxspec:containers:- name:nginximage:nginx:1.16.1# Update the version of nginx from 1.14.2 to 1.16.1ports:- containerPort:80
apiVersion:apps/v1kind:Deploymentmetadata:name:nginx-deploymentspec:selector:matchLabels:app:nginxreplicas:4# Update the replicas from 2 to 4template:metadata:labels:app:nginxspec:containers:- name:nginximage:nginx:1.14.2ports:- containerPort:80
apiVersion:v1kind:Servicemetadata:name:mysqlspec:ports:- port:3306selector:app:mysqlclusterIP:None---apiVersion:apps/v1kind:Deploymentmetadata:name:mysqlspec:selector:matchLabels:app:mysqlstrategy:type:Recreatetemplate:metadata:labels:app:mysqlspec:containers:- image:mysql:5.6name:mysqlenv:# Use secret in real usage- name:MYSQL_ROOT_PASSWORDvalue:passwordports:- containerPort:3306name:mysqlvolumeMounts:- name:mysql-persistent-storagemountPath:/var/lib/mysqlvolumes:- name:mysql-persistent-storagepersistentVolumeClaim:claimName:mysql-pv-claim
前面 YAML 文件中创建了一个允许集群内其他 Pod 访问的数据库服务。该服务中选项
clusterIP: None 让服务 DNS 名称直接解析为 Pod 的 IP 地址。
当在一个服务下只有一个 Pod 并且不打算增加 Pod 的数量这是最好的.
运行 MySQL 客户端以连接到服务器:
kubectl run -it --rm --image=mysql:5.6 --restart=Never mysql-client -- mysql -h mysql -ppassword
此命令在集群内创建一个新的 Pod 并运行 MySQL 客户端,并通过 Service 连接到服务器。
如果连接成功,你就知道有状态的 MySQL 数据库正处于运行状态。
Waiting for pod default/mysql-client-274442439-zyp6i to be running, status is Pending, pod ready: false
If you don't see a command prompt, try pressing enter.
mysql>
apiVersion:v1kind:ConfigMapmetadata:name:mysqllabels:app:mysqldata:master.cnf:| # Apply this config only on the master.
[mysqld]
log-binslave.cnf:| # Apply this config only on slaves.
[mysqld]
super-read-only
# Headless service for stable DNS entries of StatefulSet members.apiVersion:v1kind:Servicemetadata:name:mysqllabels:app:mysqlspec:ports:- name:mysqlport:3306clusterIP:Noneselector:app:mysql---# Client service for connecting to any MySQL instance for reads.# For writes, you must instead connect to the master: mysql-0.mysql.apiVersion:v1kind:Servicemetadata:name:mysql-readlabels:app:mysqlspec:ports:- name:mysqlport:3306selector:app:mysql
该脚本通过从 Pod 名称的末尾提取索引来确定自己的序号索引,而 Pod 名称由 hostname 命令返回。
然后将序数(带有数字偏移量以避免保留值)保存到 MySQL conf.d 目录中的文件 server-id.cnf。
这一操作将 StatefulSet 所提供的唯一、稳定的标识转换为 MySQL 服务器的 ID,
而这些 ID 也是需要唯一性、稳定性保证的。
第二个名为 clone-mysql 的 Init 容器,第一次在带有空 PersistentVolume 的副本 Pod
上启动时,会在从属 Pod 上执行克隆操作。
这意味着它将从另一个运行中的 Pod 复制所有现有数据,使此其本地状态足够一致,
从而可以开始从主服务器复制。
MySQL 本身不提供执行此操作的机制,因此本示例使用了一种流行的开源工具 Percona XtraBackup。
在克隆期间,源 MySQL 服务器性能可能会受到影响。
为了最大程度地减少对 MySQL 主服务器的影响,该脚本指示每个 Pod 从序号较低的 Pod 中克隆。
可以这样做的原因是 StatefulSet 控制器始终确保在启动 Pod N + 1 之前 Pod N 已准备就绪。
开始复制
Init 容器成功完成后,应用容器将运行。
MySQL Pod 由运行实际 mysqld 服务的 mysql 容器和充当
辅助工具
的 xtrabackup 容器组成。
xtrabackup sidecar 容器查看克隆的数据文件,并确定是否有必要在副本服务器上初始化 MySQL 复制。
如果是这样,它将等待 mysqld 准备就绪,然后使用从 XtraBackup 克隆文件中提取的复制参数
执行 CHANGE MASTER TO 和 START SLAVE 命令。
一旦副本服务器开始复制后,它会记住其 MySQL 主服务器,并且如果服务器重新启动或
连接中断也会自动重新连接。
另外,因为副本服务器会以其稳定的 DNS 名称查找主服务器(mysql-0.mysql),
即使由于重新调度而获得新的 Pod IP,它们也会自动找到主服务器。
最后,开始复制后,xtrabackup 容器监听来自其他 Pod 的连接,处理其数据克隆请求。
如果 StatefulSet 扩大规模,或者下一个 Pod 失去其 PersistentVolumeClaim 并需要重新克隆,
则此服务器将无限期保持运行。
发送客户端请求
你可以通过运行带有 mysql:5.7 镜像的临时容器并运行 mysql 客户端二进制文件,
将测试查询发送到 MySQL 主服务器(主机名 mysql-0.mysql)。
kubectl run mysql-client --image=mysql:5.7 -i --rm --restart=Never --\
mysql -h mysql-0.mysql <<EOF
CREATE DATABASE test;
CREATE TABLE test.messages (message VARCHAR(250));
INSERT INTO test.messages VALUES ('hello');
EOF
使用主机名 mysql-read 将测试查询发送到任何报告为就绪的服务器:
kubectl run mysql-client --image=mysql:5.7 -i -t --rm --restart=Never --\
mysql -h mysql-read -e "SELECT * FROM test.messages"
你应该获得如下输出:
Waiting for pod default/mysql-client to be running, status is Pending, pod ready: false
+---------+
| message |
+---------+
| hello |
+---------+
pod "mysql-client" deleted
kubectl run mysql-client --image=mysql:5.7 -i -t --rm --restart=Never --\
mysql -h mysql-3.mysql -e "SELECT * FROM test.messages"
Waiting for pod default/mysql-client to be running, status is Pending, pod ready: false
+---------+
| message |
+---------+
| hello |
+---------+
pod "mysql-client" deleted
为了让上面操作能够体面地终止 Pod,Pod 一定不能 设置 pod.Spec.TerminationGracePeriodSeconds 为 0。
将 pod.Spec.TerminationGracePeriodSeconds 设置为 0s 的做法是不安全的,强烈建议 StatefulSet 类型的
Pod 不要使用。体面删除是安全的,并且会在 kubelet 从 API 服务器中删除资源名称之前确保
体面地结束 pod 。
当某个节点不可达时,不会引发自动删除 Pod。
在无法访问的节点上运行的 Pod 在
超时
后会进入'Terminating' 或者 'Unknown' 状态。
当用户尝试体面地删除无法访问的节点上的 Pod 时 Pod 也可能会进入这些状态。
从 API 服务器上删除处于这些状态 Pod 的仅有可行方法如下:
Pod 水平自动扩缩(Horizontal Pod Autoscaler)
可以基于 CPU 利用率自动扩缩 ReplicationController、Deployment、ReplicaSet 和
StatefulSet 中的 Pod 数量。
除了 CPU 利用率,也可以基于其他应程序提供的
自定义度量指标
来执行自动扩缩。
Pod 自动扩缩不适用于无法扩缩的对象,比如 DaemonSet。
Pod 水平自动扩缩特性由 Kubernetes API 资源和控制器实现。资源决定了控制器的行为。
控制器会周期性地调整副本控制器或 Deployment 中的副本数量,以使得类似 Pod 平均 CPU
利用率、平均内存利用率这类观测到的度量值与用户所设定的目标值匹配。
Pod 水平自动扩缩工作机制
Pod 水平自动扩缩器的实现是一个控制回路,由控制器管理器的 --horizontal-pod-autoscaler-sync-period 参数指定周期(默认值为 15 秒)。
每个周期内,控制器管理器根据每个 HorizontalPodAutoscaler 定义中指定的指标查询资源利用率。
控制器管理器可以从资源度量指标 API(按 Pod 统计的资源用量)和自定义度量指标
API(其他指标)获取度量值。
对于按 Pod 统计的资源指标(如 CPU),控制器从资源指标 API 中获取每一个
HorizontalPodAutoscaler 指定的 Pod 的度量值,如果设置了目标使用率,
控制器获取每个 Pod 中的容器资源使用情况,并计算资源使用率。
如果设置了 target 值,将直接使用原始数据(不再计算百分比)。
接下来,控制器根据平均的资源使用率或原始值计算出扩缩的比例,进而计算出目标副本数。
需要注意的是,如果 Pod 某些容器不支持资源采集,那么控制器将不会使用该 Pod 的 CPU 使用率。
下面的算法细节章节将会介绍详细的算法。
如果 Pod 使用自定义指示,控制器机制与资源指标类似,区别在于自定义指标只使用
原始值,而不是使用率。
如果 Pod 使用对象指标和外部指标(每个指标描述一个对象信息)。
这个指标将直接根据目标设定值相比较,并生成一个上面提到的扩缩比例。
在 autoscaling/v2beta2 版本 API 中,这个指标也可以根据 Pod 数量平分后再计算。
Kubernetes 1.6 开始支持基于多个度量值进行扩缩。
你可以使用 autoscaling/v2beta2 API 来为 Horizontal Pod Autoscaler 指定多个指标。
Horizontal Pod Autoscaler 会根据每个指标计算,并生成一个扩缩建议。
幅度最大的扩缩建议会被采纳。
自定义指标支持
说明: 在 Kubernetes 1.2 增加了支持基于使用特殊注解表达的、特定于具体应用的扩缩能力,
此能力处于 Alpha 阶段。
从 Kubernetes 1.6 起,由于新的 autoscaling API 的引入,这些 annotation 就被废弃了。
虽然收集自定义指标的旧方法仍然可用,Horizontal Pod Autoscaler 调度器将不会再使用这些度量值。
同时,Horizontal Pod Autoscaler 也不再使用之前用于指定用户自定义指标的注解。
自 Kubernetes 1.6 起,Horizontal Pod Autoscaler 支持使用自定义指标。
你可以使用 autoscaling/v2beta2 API 为 Horizontal Pod Autoscaler 指定用户自定义指标。
Kubernetes 会通过用户自定义指标 API 来获取相应的指标。
Name: cm-test
Namespace: prom
Labels: <none>
Annotations: <none>
CreationTimestamp: Fri, 16 Jun 2017 18:09:22 +0000
Reference: ReplicationController/cm-test
Metrics: ( current / target )
"http_requests" on pods: 66m / 500m
Min replicas: 1
Max replicas: 4
ReplicationController pods: 1 current / 1 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale the last scale time was sufficiently old as to warrant a new scale
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from pods metric http_requests
ScalingLimited False DesiredWithinRange the desired replica count is within the acceptable range
Events:
在运行在 Pod 中时,可以通过 default 命名空间中的名为 kubernetes 的服务访问
Kubernetes API 服务器。也就是说,Pod 可以使用 kubernetes.default.svc 主机名
来查询 API 服务器。官方客户端库自动完成这个工作。
向 API 服务器进行身份认证的推荐做法是使用
服务账号凭据。
默认情况下,每个 Pod 与一个服务账号关联,该服务账户的凭证(令牌)放置在此 Pod 中
每个容器的文件系统树中的 /var/run/secrets/kubernetes.io/serviceaccount/token 处。
如果证书包可用,则凭证包被放入每个容器的文件系统树中的
/var/run/secrets/kubernetes.io/serviceaccount/ca.crt 处,
且将被用于验证 API 服务器的服务证书。
最后,用于命名空间域 API 操作的默认命名空间放置在每个容器中的
/var/run/secrets/kubernetes.io/serviceaccount/namespace 文件中。
使用 kubectl proxy
如果你希望不使用官方客户端库就完成 API 查询,可以将 kubectl proxy 作为
command
在 Pod 中启动一个边车(Sidecar)容器。这样,kubectl proxy 自动完成对 API
的身份认证,并将其暴露到 Pod 的 localhost 接口,从而 Pod 中的其他容器可以
直接使用 API。
不使用代理
通过将认证令牌直接发送到 API 服务器,也可以避免运行 kubectl proxy 命令。
内部的证书机制能够为链接提供保护。
# 指向内部 API 服务器的主机名APISERVER=https://kubernetes.default.svc
# 服务账号令牌的路径SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount
# 读取 Pod 的名字空间NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)# 读取服务账号的持有者令牌TOKEN=$(cat ${SERVICEACCOUNT}/token)# 引用内部证书机构(CA)CACERT=${SERVICEACCOUNT}/ca.crt
# 使用令牌访问 API
curl --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X GET ${APISERVER}/api
# 创建一个临时的可交互的 Pod
kubectl run -i --tty temp --image ubuntu:14.04
Waiting for pod default/temp-loe07 to be running, status is Pending, pod ready: false
... [ previous line repeats several times .. hit return when it stops ] ...
/usr/bin/amqp-declare-queue --url=$BROKER_URL -q job1 -d job1
for f in apple banana cherry date fig grape lemon melon
do
/usr/bin/amqp-publish --url=$BROKER_URL -r job1 -p -b $fdone
#!/usr/bin/env python# Just prints standard out and sleeps for 10 seconds.importsysimporttimeprint("Processing "+ sys.stdin.readlines()[0])
time.sleep(10)
#!/usr/bin/env pythonimporttimeimportrediswq
host="redis"# Uncomment next two lines if you do not have Kube-DNS working.# import os# host = os.getenv("REDIS_SERVICE_HOST")
q = rediswq.RedisWQ(name="job2", host=host)
print("Worker with sessionID: "+ q.sessionID())
print("Initial queue state: empty="+str(q.empty()))
whilenot q.empty():
item = q.lease(lease_secs=10, block=True, timeout=2)
if item isnot None:
itemstr = item.decode("utf-8")
print("Working on "+ itemstr)
time.sleep(10) # Put your actual work here instead of sleep.
q.complete(item)
else:
print("Waiting for work")
print("Queue empty, exiting")
服务(可选):对于部分应用(比如前端),你可能想对外暴露一个
Service ,这个 Service
可能用的是集群之外的公网 IP 地址(外部 Service)。
说明: 对于外部服务,你可能需要开放一个或多个端口才行。
其它只能对集群内部可见的 Service 称为内部 Service。
不管哪种 Service 类型,如果你选择创建一个 Service,而且容器在一个端口上开启了监听(入向的),
那么你需要定义两个端口。创建的 Service 会把(入向的)端口映射到容器可见的目标端口。
该 Service 会把流量路由到你部署的 Pod。支持 TCP 协议和 UDP 协议。
这个 Service 的内部 DNS 解析名就是之前你定义的应用名称的值。
有些集群可能允许你通过 SSH 连接到节点,从那你可能可以访问集群的服务。
这是一个非正式的方式,可能可以运行在个别的集群上。
浏览器和其它一些工具可能没有被安装。集群的 DNS 可能无法使用。
发现内建服务
通常来说,集群中会有 kube-system 创建的一些运行的服务。
通过 kubectl cluster-info 命令获得这些服务列表:
kubectl cluster-info
Kubernetes master is running at https://104.197.5.247
elasticsearch-logging is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/elasticsearch-logging/proxy
kibana-logging is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/kibana-logging/proxy
kube-dns is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/kube-dns/proxy
grafana is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/monitoring-grafana/proxy
heapster is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/monitoring-heapster/proxy
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
mongo ClusterIP 10.96.41.183 <none> 27017/TCP 11s
验证 MongoDB 服务是否运行在 Pod 中并且监听 27017 端口:
# Change mongo-75f59d57f4-4nd6q to the name of the Pod
kubectl get pod mongo-75f59d57f4-4nd6q --template='{{(index (index .spec.containers 0).ports 0).containerPort}}{{"\n"}}'
输出应该显示 Pod 中 MongoDB 的端口:
27017
(这是 Internet 分配给 MongoDB 的 TCP 端口)。
转发一个本地端口到 Pod 端口
kubectl port-forward 允许使用资源名称
(例如 pod 名称)来选择匹配的 pod 来进行端口转发。
# Change mongo-75f59d57f4-4nd6q to the name of the Pod
kubectl port-forward mongo-75f59d57f4-4nd6q 28015:27017
kubectl get pods --selector="run=load-balancer-example" --output=wide
输出类似于:
NAME READY STATUS ... IP NODE
hello-world-2895499144-bsbk5 1/1 Running ... 10.200.1.4 worker1
hello-world-2895499144-m1pwt 1/1 Running ... 10.200.2.5 worker2
获取运行 Hello World 的 pod 的其中一个节点的公共 IP 地址。如何获得此地址取决于你设置集群的方式。
例如,如果你使用的是 Minikube,则可以通过运行 kubectl cluster-info 来查看节点地址。
如果你使用的是 Google Compute Engine 实例,则可以使用 gcloud compute instances list 命令查看节点的公共地址。
# The identifier Backend is internal to nginx, and used to name this specific upstream
upstream Backend {
# hello is the internal DNS name used by the backend Service inside Kubernetes
server hello;
}
server {
listen 80;
location / {
# The following statement will proxy traffic to the upstream named Backend
proxy_pass http://Backend;
}
}
与后端类似,前端用包含一个 Deployment 和一个 Service。后端与前端服务之间的一个
重要区别是前端 Service 的配置文件包含了 type: LoadBalancer,也就是说,Service
会使用你的云服务商的默认负载均衡设备,从而实现从集群外访问的目的。
具体来说,如果服务具有 type LoadBalancer,则服务控制器将附加一个名为
service.kubernetes.io/load-balancer-cleanup 的终结器。
仅在清除负载均衡器资源后才能删除终结器。
即使在诸如服务控制器崩溃之类的极端情况下,这也可以防止负载均衡器资源悬空。
外部负载均衡器提供商
请务必注意,此功能的数据路径由 Kubernetes 集群外部的负载均衡器提供。
当服务 type 设置为 LoadBalancer 时,Kubernetes 向集群中的 Pod 提供的功能等同于
type 等于 ClusterIP,并通过使用 Kubernetes pod 的条目对负载均衡器(从外部到 Kubernetes)
进行编程来扩展它。
Kubernetes 服务控制器自动创建外部负载均衡器、健康检查(如果需要)、防火墙规则(如果需要),
并获取云提供商分配的外部 IP 并将其填充到服务对象中。
POD ID CREATED STATE NAME NAMESPACE ATTEMPT
926f1b5a1d33a About a minute ago Ready sh-84d7dcf559-4r2gq default 0
4dccb216c4adb About a minute ago Ready nginx-65899c769f-wv2gp default 0
a86316e96fa89 17 hours ago Ready kube-proxy-gblk4 kube-system 0
919630b8f81f1 17 hours ago Ready nvidia-device-plugin-zgbbv kube-system 0
根据名称打印 Pod 清单:
crictl pods --name nginx-65899c769f-wv2gp
POD ID CREATED STATE NAME NAMESPACE ATTEMPT
4dccb216c4adb 2 minutes ago Ready nginx-65899c769f-wv2gp default 0
根据标签打印 Pod 清单:
crictl pods --label run=nginx
POD ID CREATED STATE NAME NAMESPACE ATTEMPT
4dccb216c4adb 2 minutes ago Ready nginx-65899c769f-wv2gp default 0
CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT
1f73f2d81bf98 busybox@sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47 7 minutes ago Running sh 1
9c5951df22c78 busybox@sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47 8 minutes ago Exited sh 0
87d3992f84f74 nginx@sha256:d0a8828cccb73397acb0073bf34f4d7d8aa315263f1e7806bf8c55d8ac139d5f 8 minutes ago Running nginx 0
1941fb4da154f k8s-gcrio.azureedge.net/hyperkube-amd64@sha256:00d814b1f7763f4ab5be80c58e98140dfc69df107f253d7fdd714b30a714260a 18 hours ago Running kube-proxy 0
打印正在运行的容器清单:
crictl ps
CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT
1f73f2d81bf98 busybox@sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47 6 minutes ago Running sh 1
87d3992f84f74 nginx@sha256:d0a8828cccb73397acb0073bf34f4d7d8aa315263f1e7806bf8c55d8ac139d5f 7 minutes ago Running nginx 0
1941fb4da154f k8s-gcrio.azureedge.net/hyperkube-amd64@sha256:00d814b1f7763f4ab5be80c58e98140dfc69df107f253d7fdd714b30a714260a 17 hours ago Running kube-proxy 0
...
"2: Mon Jan 1 00:01:02 UTC 2001\n"
"1: Mon Jan 1 00:01:01 UTC 2001\n"
"0: Mon Jan 1 00:01:00 UTC 2001\n"
...
"2: Mon Jan 1 00:00:02 UTC 2001\n"
"1: Mon Jan 1 00:00:01 UTC 2001\n"
"0: Mon Jan 1 00:00:00 UTC 2001\n"
apiVersion:audit.k8s.io/v1# This is required.kind:Policy# Don't generate audit events for all requests in RequestReceived stage.omitStages:- "RequestReceived"rules:# Log pod changes at RequestResponse level- level:RequestResponseresources:- group:""# Resource "pods" doesn't match requests to any subresource of pods,# which is consistent with the RBAC policy.resources:["pods"]# Log "pods/log", "pods/status" at Metadata level- level:Metadataresources:- group:""resources:["pods/log","pods/status"]# Don't log requests to a configmap called "controller-leader"- level:Noneresources:- group:""resources:["configmaps"]resourceNames:["controller-leader"]# Don't log watch requests by the "system:kube-proxy" on endpoints or services- level:Noneusers:["system:kube-proxy"]verbs:["watch"]resources:- group:""# core API groupresources:["endpoints","services"]# Don't log authenticated requests to certain non-resource URL paths.- level:NoneuserGroups:["system:authenticated"]nonResourceURLs:- "/api*"# Wildcard matching.- "/version"# Log the request body of configmap changes in kube-system.- level:Requestresources:- group:""# core API groupresources:["configmaps"]# This rule only applies to resources in the "kube-system" namespace.# The empty string "" can be used to select non-namespaced resources.namespaces:["kube-system"]# Log configmap and secret changes in all other namespaces at the Metadata level.- level:Metadataresources:- group:""# core API groupresources:["secrets","configmaps"]# Log all other resources in core and extensions at the Request level.- level:Requestresources:- group:""# core API group- group:"extensions"# Version of group should NOT be included.# A catch-all rule to log all other requests at the Metadata level.- level:Metadata# Long-running requests like watches that fall under this rule will not# generate an audit event in RequestReceived.omitStages:- "RequestReceived"
I0805 10:43:25.129850 46757 schema.go:126] unknown field: commnd
I0805 10:43:25.129973 46757 schema.go:129] this may be a false alarm, see https://github.com/kubernetes/kubernetes/issues/6842
pods/mypod
接下来就要检查的是 API 服务器上的 Pod 与你所期望创建的是否匹配
(例如,你原本使用本机上的一个 YAML 文件来创建 Pod)。
例如,运行 kubectl get pods/mypod -o yaml > mypod-on-apiserver.yaml,之后
手动比较 mypod.yaml 与从 API 服务器取回的 Pod 描述。
从 API 服务器处获得的 YAML 通常包含一些创建 Pod 所用的 YAML 中不存在的行,这是正常的。
不过,如果如果源文件中有些行在 API 服务器版本中不存在,则意味着
Pod 规约是有问题的。
除了 kubectl describe pod 以外,另一种获取 Pod 额外信息(除了 kubectl get pod)的方法
是给 kubectl get pod 增加 -o yaml 输出格式参数。
该命令将以 YAML 格式为你提供比 kubectl describe pod 更多的信息 —— 实际上是系统拥有的关于 Pod 的所有信息。
在这里,你将看到注解(没有标签限制的键值元数据,由 Kubernetes 系统组件在内部使用)、
重启策略、端口和卷等。
kubectl get pod nginx-deployment-1006230814-6winp -o yaml
I1027 22:14:53.995134 5063 server.go:200] Running in resource-only container "/kube-proxy"
I1027 22:14:53.998163 5063 server.go:247] Using iptables Proxier.
I1027 22:14:53.999055 5063 server.go:255] Tearing down userspace rules. Errors here are acceptable.
I1027 22:14:54.038140 5063 proxier.go:352] Setting endpoints for "kube-system/kube-dns:dns-tcp" to [10.244.1.3:53]
I1027 22:14:54.038164 5063 proxier.go:352] Setting endpoints for "kube-system/kube-dns:dns" to [10.244.1.3:53]
I1027 22:14:54.038209 5063 proxier.go:352] Setting endpoints for "default/kubernetes:https" to [10.240.0.2:443]
I1027 22:14:54.038238 5063 proxier.go:429] Not syncing iptables until Services and Endpoints have been received from master
I1027 22:14:54.040048 5063 proxier.go:294] Adding new service "default/kubernetes:https" at 10.0.0.1:443/TCP
I1027 22:14:54.040154 5063 proxier.go:294] Adding new service "kube-system/kube-dns:dns" at 10.0.0.10:53/UDP
I1027 22:14:54.040223 5063 proxier.go:294] Adding new service "kube-system/kube-dns:dns-tcp" at 10.0.0.10:53/TCP
for intf in /sys/devices/virtual/net/cbr0/brif/*; do cat $intf/hairpin_mode; done
1
1
1
1
如果有效的发卡模式是 promiscuous-bridge, 要保证 Kubelet 有操作节点上
Linux 网桥的权限。如果 cbr0 桥正在被使用且被正确设置,你将会看到如下输出:
ifconfig cbr0 |grep PROMISC
UP BROADCAST RUNNING PROMISC MULTICAST MTU:1460 Metric:1
如果以上步骤都不能解决问题,请寻求帮助。
寻求帮助
如果你走到这一步,那么就真的是奇怪的事情发生了。你的 Service 正在运行,有 Endpoints 存在,
你的 Pods 也确实在提供服务。你的 DNS 正常,iptables 规则已经安装,kube-proxy 看起来也正常。
然而 Service 还是没有正常工作。这种情况下,请告诉我们,以便我们可以帮助调查!
OCI runtime exec failed: exec failed: container_linux.go:346: starting container process caused "exec: \"sh\": executable file not found in $PATH": unknown
当你删除某 CustomResourceDefinition 时,服务器会卸载其 RESTful API
端点,并删除服务器上存储的所有定制对象。
kubectl delete -f resourcedefinition.yaml
kubectl get crontabs
Error from server (NotFound): Unable to list {"stable.example.com" "v1" "crontabs"}: the server could not find the requested resource (get crontabs.stable.example.com)
properties:foo:pattern:"abc"metadata:type:objectproperties:name:type:stringpattern:"^a"finalizers:type:arrayitems:type:stringpattern:"my-finalizer"anyOf:- properties:bar:type:integerminimum:42required:["bar"]description:"foo bar object"
不是一个结构化的模式,因为其中存在以下违例:
根节点缺失 type 设置(规则 1)
foo 的 type 缺失(规则 1)
anyOf 中的 bar 未在外部指定(规则 2)
bar 的 type 位于 anyOf 中(规则 3)
anyOf 中设置了 description (规则 3)
metadata.finalizers 不可以被限制 (规则 4)
作为对比,下面的 YAML 所对应的模式则是结构化的:
type:objectdescription:"foo bar object"properties:foo:type:stringpattern:"abc"bar:type:integermetadata:type:objectproperties:name:type:stringpattern:"^a"anyOf:- properties:bar:minimum:42required:["bar"]
apiVersion:apiextensions.k8s.io/v1kind:CustomResourceDefinitionmetadata:name:crontabs.stable.example.comspec:group:stable.example.comversions:- name:v1served:truestorage:trueschema:# openAPIV3Schema is the schema for validating custom objects.openAPIV3Schema:type:objectproperties:spec:type:objectproperties:cronSpec:type:stringpattern:'^(\d+|\*)(/\d+)?(\s+(\d+|\*)(/\d+)?){4}$'image:type:stringreplicas:type:integerminimum:1maximum:10scope:Namespacednames:plural:crontabssingular:crontabkind:CronTabshortNames:- ct
The CronTab "my-new-cron-object" is invalid: []: Invalid value: map[string]interface {}{"apiVersion":"stable.example.com/v1", "kind":"CronTab", "metadata":map[string]interface {}{"name":"my-new-cron-object", "namespace":"default", "deletionTimestamp":interface {}(nil), "deletionGracePeriodSeconds":(*int64)(nil), "creationTimestamp":"2017-09-05T05:20:07Z", "uid":"e14d79e7-91f9-11e7-a598-f0761cb232d1", "clusterName":""}, "spec":map[string]interface {}{"cronSpec":"* * * *", "image":"my-awesome-cron-image", "replicas":15}}:
validation failure list:
spec.cronSpec in body should match '^(\d+|\*)(/\d+)?(\s+(\d+|\*)(/\d+)?){4}$'
spec.replicas in body should be less than or equal to 10
apiVersion:apiextensions.k8s.io/v1kind:CustomResourceDefinitionmetadata:name:crontabs.stable.example.comspec:group:stable.example.comscope:Namespacednames:plural:crontabssingular:crontabkind:CronTabshortNames:- ctversions:- name:v1served:truestorage:trueschema:openAPIV3Schema:type:objectproperties:spec:type:objectproperties:cronSpec:type:stringimage:type:stringreplicas:type:integeradditionalPrinterColumns:- name:Spectype:stringdescription:The cron spec defining the interval a CronJob is runjsonPath:.spec.cronSpec- name:Replicastype:integerdescription:The number of jobs launched by the CronJobjsonPath:.spec.replicas- name:Agetype:datejsonPath:.metadata.creationTimestamp
创建 CustomResourceDefinition:
kubectl apply -f resourcedefinition.yaml
使用前文中的 my-crontab.yaml 创建一个实例。
启用服务器端打印输出:
kubectl get crontab my-new-cron-object
注意输出中的 NAME、SPEC、REPLICAS 和 AGE 列:
NAME SPEC REPLICAS AGE
my-new-cron-object * * * * * 1 7s
apiVersion:apiextensions.k8s.io/v1kind:CustomResourceDefinitionname:crontabs.example.comspec:group:example.comnames:plural:crontabssingular:crontabkind:CronTabscope:Namespacedversions:- name:v1alpha1served:true# 此属性标明此定制资源的 v1alpha1 版本已被弃用。# 发给此版本的 API 请求会在服务器响应中收到警告消息头。deprecated:true# 此属性设置用来覆盖返回给发送 v1alpha1 API 请求的客户端的默认警告信息。deprecationWarning:"example.com/v1alpha1 CronTab is deprecated; see http://example.com/v1alpha1-v1 for instructions to migrate to example.com/v1 CronTab"schema:...- name:v1beta1served:true# 此属性标明该定制资源的 v1beta1 版本已被弃用。# 发给此版本的 API 请求会在服务器响应中收到警告消息头。# 针对此版本的请求所返回的是默认的警告消息。deprecated:trueschema:...- name:v1served:truestorage:trueschema:...
# 在 v1.16 中弃用以推荐使用 apiextensions.k8s.io/v1apiVersion:apiextensions.k8s.io/v1beta1kind:CustomResourceDefinitionmetadata:name:crontabs.example.comspec:group:example.comnames:plural:crontabssingular:crontabkind:CronTabscope:Namespacedvalidation:...versions:- name:v1alpha1served:true# 此属性标明此定制资源的 v1alpha1 版本已被弃用。# 发给此版本的 API 请求会在服务器响应中收到警告消息头。deprecated:true# 此属性设置用来覆盖返回给发送 v1alpha1 API 请求的客户端的默认警告信息。deprecationWarning:"example.com/v1alpha1 CronTab is deprecated; see http://example.com/v1alpha1-v1 for instructions to migrate to example.com/v1 CronTab"- name:v1beta1served:true# 此属性标明该定制资源的 v1beta1 版本已被弃用。# 发给此版本的 API 请求会在服务器响应中收到警告消息头。# 针对此版本的请求所返回的是默认的警告消息。deprecated:true- name:v1served:truestorage:true
apiVersion:apiextensions.k8s.io/v1kind:ConversionReviewresponse:uid:<value from request.uid>result:{status:Failedmessage:hostPort could not be parsed into a separate host and port
# v1.16 中弃用以推荐使用 apiextensions.k8s.io/v1apiVersion:apiextensions.k8s.io/v1beta1kind:ConversionReviewresponse:uid:<value from request.uid>result:status:Failedmessage:hostPort could not be parsed into a separate host and port
--requestheader-client-ca-file=<path to aggregator CA cert>
--requestheader-allowed-names=front-proxy-client
--requestheader-extra-headers-prefix=X-Remote-Extra-
--requestheader-group-headers=X-Remote-Group
--requestheader-username-headers=X-Remote-User
--proxy-client-cert-file=<path to aggregator proxy cert>
--proxy-client-key-file=<path to aggregator proxy key>
如果你未在运行 API 服务器的主机上运行 kube-proxy,则必须确保使用以下
kube-apiserver 标志启用系统:
--enable-aggregator-routing=true
注册 APIService 对象
你可以动态配置将哪些客户端请求代理到扩展 apiserver。以下是注册示例:
apiVersion:apiregistration.k8s.io/v1kind:APIServicemetadata:name:<注释对象名称>spec:group:<扩展 Apiserver 的 API 组名>version:<扩展 Apiserver 的 API 版本>groupPriorityMinimum:<APIService 对应组的优先级, 参考 API 文档>versionPriority:<版本在组中的优先排序, 参考 API 文档>service:namespace:<拓展 Apiserver 服务的名字空间>name:<拓展 Apiserver 服务的名称>caBundle:<PEM 编码的 CA 证书,用于对 Webhook 服务器的证书签名>
为了更容易地完成这些示例,我们没有验证 Pod 实际上是使用所需的调度程序调度的。
我们可以通过更改 Pod 的顺序和上面的部署配置提交来验证这一点。
如果我们在提交调度器部署配置之前将所有 Pod 配置提交给 Kubernetes 集群,
我们将看到注解了 annotation-second-scheduler 的 Pod 始终处于 “Pending” 状态,
而其他两个 Pod 被调度。
一旦我们提交调度器部署配置并且我们的新调度器开始运行,注解了
annotation-second-scheduler 的 pod 就能被调度。
或者,可以查看事件日志中的 “Scheduled” 条目,以验证是否由所需的调度器调度了 Pod。
kubectl get events
你也可以使用自定义调度器配置
或自定义容器镜像,用于集群的主调度器,方法是在相关控制平面节点上修改其静态 pod 清单。
apiVersion:apiserver.k8s.io/v1beta1kind:EgressSelectorConfigurationegressSelections:# Since we want to control the egress traffic to the cluster, we use the# "cluster" as the name. Other supported values are "etcd", and "master".- name:clusterconnection:# This controls the protocol between the API Server and the Konnectivity# server. Supported values are "GRPC" and "HTTPConnect". There is no# end user visible difference between the two modes. You need to set the# Konnectivity server to work in the same mode.proxyProtocol:GRPCtransport:# This controls what transport the API Server uses to communicate with the# Konnectivity server. UDS is recommended if the Konnectivity server# locates on the same machine as the API Server. You need to configure the# Konnectivity server to listen on the same UDS socket.# The other supported transport is "tcp". You will need to set up TLS # config to secure the TCP transport.uds:udsName:/etc/kubernetes/konnectivity-server/konnectivity-server.socket
apiVersion:v1kind:Podmetadata:name:konnectivity-servernamespace:kube-systemspec:priorityClassName:system-cluster-criticalhostNetwork:truecontainers:- name:konnectivity-server-containerimage:us.gcr.io/k8s-artifacts-prod/kas-network-proxy/proxy-server:v0.0.16command:["/proxy-server"]args:["--logtostderr=true",# This needs to be consistent with the value set in egressSelectorConfiguration."--uds-name=/etc/kubernetes/konnectivity-server/konnectivity-server.socket",# The following two lines assume the Konnectivity server is# deployed on the same machine as the apiserver, and the certs and# key of the API Server are at the specified location."--cluster-cert=/etc/kubernetes/pki/apiserver.crt","--cluster-key=/etc/kubernetes/pki/apiserver.key",# This needs to be consistent with the value set in egressSelectorConfiguration."--mode=grpc","--server-port=0","--agent-port=8132","--admin-port=8133","--health-port=8134","--agent-namespace=kube-system","--agent-service-account=konnectivity-agent","--kubeconfig=/etc/kubernetes/konnectivity-server.conf","--authentication-audience=system:konnectivity-server"]livenessProbe:httpGet:scheme:HTTPhost:127.0.0.1port:8134path:/healthzinitialDelaySeconds:30timeoutSeconds:60ports:- name:agentportcontainerPort:8132hostPort:8132- name:adminportcontainerPort:8133hostPort:8133- name:healthportcontainerPort:8134hostPort:8134volumeMounts:- name:k8s-certsmountPath:/etc/kubernetes/pkireadOnly:true- name:kubeconfigmountPath:/etc/kubernetes/konnectivity-server.confreadOnly:true- name:konnectivity-udsmountPath:/etc/kubernetes/konnectivity-serverreadOnly:falsevolumes:- name:k8s-certshostPath:path:/etc/kubernetes/pki- name:kubeconfighostPath:path:/etc/kubernetes/konnectivity-server.conftype:FileOrCreate- name:konnectivity-udshostPath:path:/etc/kubernetes/konnectivity-servertype:DirectoryOrCreate
apiVersion:apps/v1# Alternatively, you can deploy the agents as Deployments. It is not necessary# to have an agent on each node.kind:DaemonSetmetadata:labels:addonmanager.kubernetes.io/mode:Reconcilek8s-app:konnectivity-agentnamespace:kube-systemname:konnectivity-agentspec:selector:matchLabels:k8s-app:konnectivity-agenttemplate:metadata:labels:k8s-app:konnectivity-agentspec:priorityClassName:system-cluster-criticaltolerations:- key:"CriticalAddonsOnly"operator:"Exists"containers:- image:us.gcr.io/k8s-artifacts-prod/kas-network-proxy/proxy-agent:v0.0.16name:konnectivity-agentcommand:["/proxy-agent"]args:["--logtostderr=true","--ca-cert=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt",# Since the konnectivity server runs with hostNetwork=true,# this is the IP address of the master machine."--proxy-server-host=35.225.206.7","--proxy-server-port=8132","--admin-server-port=8133","--health-server-port=8134","--service-account-token-path=/var/run/secrets/tokens/konnectivity-agent-token"]volumeMounts:- mountPath:/var/run/secrets/tokensname:konnectivity-agent-tokenlivenessProbe:httpGet:port:8134path:/healthzinitialDelaySeconds:15timeoutSeconds:15serviceAccountName:konnectivity-agentvolumes:- name:konnectivity-agent-tokenprojected:sources:- serviceAccountToken:path:konnectivity-agent-tokenaudience:system:konnectivity-server
说明: kube-controller-manager 标志 --client-ca-file 和 --cluster-signing-cert-file 所引用的文件
不能是 CA 证书包。如果这些标志和 --root-ca-file 指向同一个 ca.crt 包文件(包含老的和新的 CA 证书),
你将会收到出错信息。
要解决这个问题,可以将新的 CA 证书复制到单独的文件中,并将 --client-ca-file 和 --cluster-signing-cert-file
标志指向该副本。一旦 ca.crt 不再是证书包文件,就可以恢复有问题的标志指向 ca.crt 并删除该副本。
更新所有服务账号令牌,使之同时包含老的和新的 CA 证书。
如果在 API 服务器使用新的 CA 之前启动了新的 Pod,这些 Pod
也会获得此更新并且同时信任老的和新的 CA 证书。
base64_encoded_ca="$(base64 -w0 <path to file containing both old and new CAs>)"for namespace in $(kubectl get ns --no-headers | awk '{print $1}'); dofor token in $(kubectl get secrets --namespace "$namespace" --field-selector type=kubernetes.io/service-account-token -o name); do
kubectl get $token --namespace "$namespace" -o yaml | \
/bin/sed "s/\(ca.crt:\).*/\1 ${base64_encoded_ca}/" | \
kubectl apply -f -
donedone
重启所有使用集群内配置的 Pods(例如:kube-proxy、coredns 等),以便这些 Pod 能够使用
来自 ServiceAccount Secret 中的、已更新的证书机构数据。
确保 coredns、kube-proxy 和其他使用集群内配置的 Pod 都正按预期方式工作。
将老的和新的 CA 都追加到 kube-apiserver 配置的 --client-ca-file 和 --kubelet-certificate-authority 标志所指的文件。
将老的和新的 CA 都追加到 kube-scheduler 配置的 --client-ca-file 标志所指的文件。
for namespace in $(kubectl get namespace -o jsonpath='{.items[*].metadata.name}'); dofor name in $(kubectl get deployments -n $namespace -o jsonpath='{.items[*].metadata.name}'); do
kubectl patch deployment -n ${namespace}${name} -p '{"spec":{"template":{"metadata":{"annotations":{"ca-rotation": "1"}}}}}';
donefor name in $(kubectl get daemonset -n $namespace -o jsonpath='{.items[*].metadata.name}'); do
kubectl patch daemonset -n ${namespace}${name} -p '{"spec":{"template":{"metadata":{"annotations":{"ca-rotation": "1"}}}}}';
donedone
其中 192.0.2.24 是服务的集群 IP,my-svc.my-namespace.svc.cluster.local
是服务的 DNS 名称,10.0.34.2 是 Pod 的 IP,而
my-pod.my-namespace.pod.cluster.local 是 Pod 的 DNS 名称。
你能看到以下的输出:
apiVersion:apps/v1kind:DaemonSetmetadata:name:fluentd-elasticsearchnamespace:kube-systemlabels:k8s-app:fluentd-loggingspec:selector:matchLabels:name:fluentd-elasticsearchupdateStrategy:type:RollingUpdaterollingUpdate:maxUnavailable:1template:metadata:labels:name:fluentd-elasticsearchspec:tolerations:# this toleration is to have the daemonset runnable on master nodes# remove it if your masters can't run pods- key:node-role.kubernetes.io/mastereffect:NoSchedulecontainers:- name:fluentd-elasticsearchimage:quay.io/fluentd_elasticsearch/fluentd:v2.5.2volumeMounts:- name:varlogmountPath:/var/log- name:varlibdockercontainersmountPath:/var/lib/docker/containersreadOnly:trueterminationGracePeriodSeconds:30volumes:- name:varloghostPath:path:/var/log- name:varlibdockercontainershostPath:path:/var/lib/docker/containers
apiVersion:apps/v1kind:DaemonSetmetadata:name:fluentd-elasticsearchnamespace:kube-systemlabels:k8s-app:fluentd-loggingspec:selector:matchLabels:name:fluentd-elasticsearchupdateStrategy:type:RollingUpdaterollingUpdate:maxUnavailable:1template:metadata:labels:name:fluentd-elasticsearchspec:tolerations:# this toleration is to have the daemonset runnable on master nodes# remove it if your masters can't run pods- key:node-role.kubernetes.io/mastereffect:NoSchedulecontainers:- name:fluentd-elasticsearchimage:quay.io/fluentd_elasticsearch/fluentd:v2.5.2resources:limits:memory:200Mirequests:cpu:100mmemory:200MivolumeMounts:- name:varlogmountPath:/var/log- name:varlibdockercontainersmountPath:/var/lib/docker/containersreadOnly:trueterminationGracePeriodSeconds:30volumes:- name:varloghostPath:path:/var/log- name:varlibdockercontainershostPath:path:/var/lib/docker/containers
#!/bin/bash
# 可选的参数处理
if [[ "$1" == "version" ]]
then
echo "1.0.0"
exit 0
fi
# 可选的参数处理
if [[ "$1" == "config" ]]
then
echo $KUBECONFIG
exit 0
fi
echo "I am a plugin named kubectl-foo"
# 创建文件名中包含下划线的插件echo -e '#!/bin/bash\n\necho "I am a plugin with a dash in my name"' > ./kubectl-foo_bar
sudo chmod +x ./kubectl-foo_bar
# 将插件放到 PATH 下
sudo mv ./kubectl-foo_bar /usr/local/bin
# 现在可以通过 kubectl 来调用插件
kubectl foo-bar
PATH=/usr/local/bin/plugins:/usr/local/bin/moreplugins kubectl plugin list
The following kubectl-compatible plugins are available:
/usr/local/bin/plugins/kubectl-foo
/usr/local/bin/moreplugins/kubectl-foo
- warning: /usr/local/bin/moreplugins/kubectl-foo is overshadowed by a similarly named plugin: /usr/local/bin/plugins/kubectl-foo
error: one plugin warning was found
你可以使用前面提到的 kubectl plugin list 命令来确保你的插件可以被 kubectl 看到,
并且验证没有警告防止它被称为 kubectl 命令。
kubectl plugin list
The following kubectl-compatible plugins are available:
test/fixtures/pkg/kubectl/plugins/kubectl-foo
/usr/local/bin/kubectl-foo
- warning: /usr/local/bin/kubectl-foo is overshadowed by a similarly named plugin: test/fixtures/pkg/kubectl/plugins/kubectl-foo
plugins/kubectl-invalid
- warning: plugins/kubectl-invalid identified as a kubectl plugin, but it is not executable
error: 2 plugin warnings were found
使用命令行运行时包
如果你在编写 kubectl 插件,而且你选择使用 Go 语言,你可以利用
cli-runtime 工具库。
这些库提供了一些辅助函数,用来解析和更新用户的
kubeconfig
文件,向 API 服务器发起 REST 风格的请求,或者将参数绑定到某配置上,
抑或将其打印输出。
apiVersion:v1kind:Podmetadata:name:cuda-vector-addspec:restartPolicy:OnFailurecontainers:- name:cuda-vector-add# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfileimage:"k8s.gcr.io/cuda-vector-add:v0.1"resources:limits:nvidia.com/gpu:1nodeSelector:accelerator:nvidia-tesla-p100# or nvidia-tesla-k80 etc.
Kubernetes Pod 是由一个或多个
为了管理和联网而绑定在一起的容器构成的组。 本教程中的 Pod 只有一个容器。
Kubernetes Deployment
检查 Pod 的健康状况,并在 Pod 中的容器终止的情况下重新启动新的容器。
Deployment 是管理 Pod 创建和扩展的推荐方法。
使用 kubectl create 命令创建管理 Pod 的 Deployment。该 Pod 根据提供的 Docker
镜像运行 Container。
通过现代的 Web 服务,用户希望应用程序能够 24/7 全天候使用,开发人员希望每天可以多次发布部署新版本的应用程序。 容器化可以帮助软件包达成这些目标,使应用程序能够以简单快速的方式发布和更新,而无需停机。Kubernetes 帮助您确保这些容器化的应用程序在您想要的时间和地点运行,并帮助应用程序找到它们需要的资源和工具。Kubernetes 是一个可用于生产的开源平台,根据 Google 容器集群方面积累的经验,以及来自社区的最佳实践而设计。
Pod 为特定于应用程序的“逻辑主机”建模,并且可以包含相对紧耦合的不同应用容器。例如,Pod 可能既包含带有 Node.js 应用的容器,也包含另一个不同的容器,用于提供 Node.js 网络服务器要发布的数据。Pod 中的容器共享 IP 地址和端口,始终位于同一位置并且共同调度,并在同一工作节点上的共享上下文中运行。
Pod是 Kubernetes 平台上的原子单元。 当我们在 Kubernetes 上创建 Deployment 时,该 Deployment 会在其中创建包含容器的 Pod (而不是直接创建容器)。每个 Pod 都与调度它的工作节点绑定,并保持在那里直到终止(根据重启策略)或删除。 如果工作节点发生故障,则会在群集中的其他可用工作节点上调度相同的 Pod。
总结:
Pods
工作节点
Kubectl 主要命令
Pod 是一组一个或多个应用程序容器(例如 Docker),包括共享存储(卷), IP 地址和有关如何运行它们的信息。
Pod 概览
工作节点
一个 pod 总是运行在 工作节点。工作节点是 Kubernetes 中的参与计算的机器,可以是虚拟机或物理计算机,具体取决于集群。每个工作节点由主节点管理。工作节点可以有多个 pod ,Kubernetes 主节点会自动处理在群集中的工作节点上调度 pod 。 主节点的自动调度考量了每个工作节点上的可用资源。
每个 Kubernetes 工作节点至少运行:
Kubelet,负责 Kubernetes 主节点和工作节点之间通信的过程; 它管理 Pod 和机器上运行的容器。
了解 标签(Label) 和 标签选择器(Label Selector) 对象如何与 Service 关联
在 Kubernetes 集群外用 Service 暴露应用
Kubernetes Service 总览
Kubernetes Pod 是转瞬即逝的。 Pod 实际上拥有 生命周期。 当一个工作 Node 挂掉后, 在 Node 上运行的 Pod 也会消亡。 ReplicaSet 会自动地通过创建新的 Pod 驱动集群回到目标状态,以保证应用程序正常运行。 换一个例子,考虑一个具有3个副本数的用作图像处理的后端程序。这些副本是可替换的; 前端系统不应该关心后端副本,即使 Pod 丢失或重新创建。也就是说,Kubernetes 集群中的每个 Pod (即使是在同一个 Node 上的 Pod )都有一个惟一的 IP 地址,因此需要一种方法自动协调 Pod 之间的变更,以便应用程序保持运行。
Kubernetes 中的服务(Service)是一种抽象概念,它定义了 Pod 的逻辑集和访问 Pod 的协议。Service 使从属 Pod 之间的松耦合成为可能。 和其他 Kubernetes 对象一样, Service 用 YAML (更推荐) 或者 JSON 来定义. Service 下的一组 Pod 通常由 LabelSelector (请参阅下面的说明为什么您可能想要一个 spec 中不包含selector的服务)来标记。
尽管每个 Pod 都有一个唯一的 IP 地址,但是如果没有 Service ,这些 IP 不会暴露在群集外部。Service 允许您的应用程序接收流量。Service 也可以用在 ServiceSpec 标记type的方式暴露
ClusterIP (默认) - 在集群的内部 IP 上公开 Service 。这种类型使得 Service 只能从集群内访问。
使用外部 IP 地址(LoadBalancer Ingress)访问 Hello World 应用程序:
curl http://<external-ip>:<port>
其中 <external-ip> 是您的服务的外部 IP 地址(LoadBalancer Ingress),
<port> 是您的服务描述中的 port 的值。
如果您正在使用 minikube,输入 minikube service my-service 将在浏览器中自动打开 Hello World 应用程序。
成功请求的响应是一条问候消息:
Hello Kubernetes!
清理现场
要删除服务,请输入以下命令:
kubectl delete services my-service
要删除正在运行 Hello World 应用程序的 Deployment,ReplicaSet 和 Pod,请输入以下命令:
# SOURCE: https://cloud.google.com/kubernetes-engine/docs/tutorials/guestbookapiVersion:v1kind:Servicemetadata:name:redis-followerlabels:app:redisrole:followertier:backendspec:ports:# the port that this service should serve on- port:6379selector:app:redisrole:followertier:backend
# SOURCE: https://cloud.google.com/kubernetes-engine/docs/tutorials/guestbookapiVersion:v1kind:Servicemetadata:name:frontendlabels:app:guestbooktier:frontendspec:# if your cluster supports it, uncomment the following to automatically create# an external load-balanced IP for the frontend service.# type: LoadBalancer#type: LoadBalancerports:# the port that this service should serve on- port:80selector:app:guestbooktier:frontend
kustomization.yaml包含用于部署 WordPress 网站的所有资源以及 MySQL 数据库。您可以通过以下方式应用目录
kubectl apply -k ./
现在,您可以验证所有对象是否存在。
通过运行以下命令验证 Secret 是否存在:
kubectl get secrets
响应应如下所示:
NAME TYPE DATA AGE
mysql-pass-c57bb4t7mf Opaque 1 9s
验证是否已动态配置 PersistentVolume:
kubectl get pvc
说明: 设置和绑定 PV 可能要花费几分钟。
响应应如下所示:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
mysql-pv-claim Bound pvc-8cbd7b2e-4044-11e9-b2bb-42010a800002 20Gi RWO standard 77s
wp-pv-claim Bound pvc-8cd0df54-4044-11e9-b2bb-42010a800002 20Gi RWO standard 77s
通过运行以下命令来验证 Pod 是否正在运行:
kubectl get pods
说明: 等待 Pod 状态变成RUNNING可能会花费几分钟。
响应应如下所示:
NAME READY STATUS RESTARTS AGE
wordpress-mysql-1894417608-x5dzt 1/1 Running 0 40s
通过运行以下命令来验证 Service 是否正在运行:
kubectl get services wordpress
响应应如下所示:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
wordpress ClusterIP 10.0.0.89 <pending> 80:32406/TCP 4m
apiVersion:apps/v1kind:StatefulSetmetadata:name:cassandralabels:app:cassandraspec:serviceName:cassandrareplicas:3selector:matchLabels:app:cassandratemplate:metadata:labels:app:cassandraspec:terminationGracePeriodSeconds:1800containers:- name:cassandraimage:gcr.io/google-samples/cassandra:v13imagePullPolicy:Alwaysports:- containerPort:7000name:intra-node- containerPort:7001name:tls-intra-node- containerPort:7199name:jmx- containerPort:9042name:cqlresources:limits:cpu:"500m"memory:1Girequests:cpu:"500m"memory:1GisecurityContext:capabilities:add:- IPC_LOCKlifecycle:preStop:exec:command:- /bin/sh- -c- nodetool drainenv:- name:MAX_HEAP_SIZEvalue:512M- name:HEAP_NEWSIZEvalue:100M- name:CASSANDRA_SEEDSvalue:"cassandra-0.cassandra.default.svc.cluster.local"- name:CASSANDRA_CLUSTER_NAMEvalue:"K8Demo"- name:CASSANDRA_DCvalue:"DC1-K8Demo"- name:CASSANDRA_RACKvalue:"Rack1-K8Demo"- name:POD_IPvalueFrom:fieldRef:fieldPath:status.podIPreadinessProbe:exec:command:- /bin/bash- -c- /ready-probe.shinitialDelaySeconds:15timeoutSeconds:5# These volume mounts are persistent. They are like inline claims,# but not exactly because the names need to match exactly one of# the stateful pod volumes.volumeMounts:- name:cassandra-datamountPath:/cassandra_data# These are converted to volume claims by the controller# and mounted at the paths mentioned above.# do not use these in production until ssd GCEPersistentDisk or other ssd pdvolumeClaimTemplates:- metadata:name:cassandra-dataspec:accessModes:["ReadWriteOnce"]storageClassName:fastresources:requests:storage:1Gi---kind:StorageClassapiVersion:storage.k8s.io/v1metadata:name:fastprovisioner:k8s.io/minikube-hostpathparameters:type:pd-ssd
```yaml
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: apps/v1
kind: StatefulSet
metadata:
creationTimestamp: 2016-08-13T18:40:58Z
generation: 1
labels:
app: cassandra
name: cassandra
namespace: default
resourceVersion: "323"
uid: 7a219483-6185-11e6-a910-42010a8a0fc0
spec:
replicas: 3
```
2.将副本数 (replicas) 更改为 4,然后保存清单。
StatefulSet 现在可以扩展到运行 4 个 Pod。
3.获取 Cassandra StatefulSet 验证更改:
```shell
kubectl get statefulset cassandra
```
响应应该与此类似:
```
NAME DESIRED CURRENT AGE
cassandra 4 4 36m
```
2016-12-06 19:34:16,236 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52740
2016-12-06 19:34:16,237 [myid:1] - INFO [Thread-1136:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52740 (no session established for client)
2016-12-06 19:34:26,155 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52749
2016-12-06 19:34:26,155 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52749
2016-12-06 19:34:26,156 [myid:1] - INFO [Thread-1137:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52749 (no session established for client)
2016-12-06 19:34:26,222 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52750
2016-12-06 19:34:26,222 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52750
2016-12-06 19:34:26,226 [myid:1] - INFO [Thread-1138:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52750 (no session established for client)
2016-12-06 19:34:36,151 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52760
2016-12-06 19:34:36,152 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52760
2016-12-06 19:34:36,152 [myid:1] - INFO [Thread-1139:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52760 (no session established for client)
2016-12-06 19:34:36,230 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52761
2016-12-06 19:34:36,231 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52761
2016-12-06 19:34:36,231 [myid:1] - INFO [Thread-1140:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52761 (no session established for client)
2016-12-06 19:34:46,149 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52767
2016-12-06 19:34:46,149 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52767
2016-12-06 19:34:46,149 [myid:1] - INFO [Thread-1141:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52767 (no session established for client)
2016-12-06 19:34:46,230 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52768
2016-12-06 19:34:46,230 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52768
2016-12-06 19:34:46,230 [myid:1] - INFO [Thread-1142:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52768 (no session established for client)
waiting for statefulset rolling update to complete 0 pods at revision zk-5db4499664...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
waiting for statefulset rolling update to complete 1 pods at revision zk-5db4499664...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
waiting for statefulset rolling update to complete 2 pods at revision zk-5db4499664...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
statefulset rolling update complete 3 pods at revision zk-5db4499664...
这项操作会逆序地依次终止每一个 Pod,并用新的配置重新创建。
这样做确保了在滚动更新的过程中 quorum 依旧保持工作。
kubectl drain $(kubectl get pod zk-2 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-local-data
node "kubernetes-node-i4c4" cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-node-i4c4, kube-proxy-kubernetes-node-i4c4; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-dyrog
WARNING: Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-dyrog; Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-node-i4c4, kube-proxy-kubernetes-node-i4c4
There are pending pods when an error occurred: Cannot evict pod as it would violate the pod's disruption budget.
pod/zk-2
NGINX web 服务器默认会加载位于 /usr/share/nginx/html/index.html 的 index 文件。
StatefulSets spec 中的 volumeMounts 字段保证了 /usr/share/nginx/html 文件夹由一个 PersistentVolume 支持。
将 Pod 的主机名写入它们的index.html文件并验证 NGINX web 服务器使用该主机名提供服务。
for i in 0 1; do kubectl exec"web-$i" -- sh -c 'echo "$(hostname)" > /usr/share/nginx/html/index.html'; donefor i in 0 1; do kubectl exec -i -t "web-$i" -- curl http://localhost/; done
Apparmor 是一个 Linux 内核安全模块,它补充了标准的基于 Linux 用户和组的安全模块将程序限制为有限资源集的权限。
AppArmor 可以配置为任何应用程序减少潜在的攻击面,并且提供更加深入的防御。
AppArmor 是通过配置文件进行配置的,这些配置文件被调整为允许特定程序或者容器访问,如 Linux 功能、网络访问、文件权限等。
每个配置文件都可以在*强制(enforcing)模式(阻止访问不允许的资源)或投诉(complain)*模式
(仅报告冲突)下运行。
Kubernetes AppArmor 强制执行方式首先通过检查所有先决条件都已满足,然后将配置文件选择转发到容器运行时进行强制执行。如果未满足先决条件, Pod 将被拒绝,并且不会运行。
要验证是否应用了配置文件,可以查找容器创建事件中列出的 AppArmor 安全选项:
kubectl get events | grep Created
22s 22s 1 hello-apparmor Pod spec.containers{hello} Normal Created {kubelet e2e-test-stclair-node-pool-31nt} Created container with docker id 269a53b202d3; Security:[seccomp=unconfined apparmor=k8s-apparmor-example-deny-write]
apiVersion:v1kind:Podmetadata:name:hello-apparmorannotations:# Tell Kubernetes to apply the AppArmor profile "k8s-apparmor-example-deny-write".# Note that this is ignored if the Kubernetes node is not running version 1.4 or greater.container.apparmor.security.beta.kubernetes.io/hello:localhost/k8s-apparmor-example-deny-writespec:containers:- name:helloimage:busyboxcommand:["sh","-c","echo 'Hello AppArmor!' && sleep 1h"]
kubectl create -f ./hello-apparmor.yaml
如果我们查看 pod 事件,我们可以看到 pod 容器是用 AppArmor 配置文件 "k8s-apparmor-example-deny-write" 所创建的:
kubectl get events | grep hello-apparmor
14s 14s 1 hello-apparmor Pod Normal Scheduled {default-scheduler } Successfully assigned hello-apparmor to gke-test-default-pool-239f5d02-gyn2
14s 14s 1 hello-apparmor Pod spec.containers{hello} Normal Pulling {kubelet gke-test-default-pool-239f5d02-gyn2} pulling image "busybox"
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Pulled {kubelet gke-test-default-pool-239f5d02-gyn2} Successfully pulled image "busybox"
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Created {kubelet gke-test-default-pool-239f5d02-gyn2} Created container with docker id 06b6cd1c0989; Security:[seccomp=unconfined apparmor=k8s-apparmor-example-deny-write]
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Started {kubelet gke-test-default-pool-239f5d02-gyn2} Started container with docker id 06b6cd1c0989
调度程序不知道哪些配置文件加载到哪个节点上,因此必须将全套配置文件加载到每个节点上。另一种方法是为节点上的每个配置文件(或配置文件类)添加节点标签,并使用[节点选择器](/zh/docs/concepts/configuration/assign pod node/)确保 Pod 在具有所需配置文件的节点上运行。
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6a96207fed4b kindest/node:v1.18.2 "/usr/local/bin/entr…" 27 seconds ago Up 24 seconds 127.0.0.1:42223->6443/tcp kind-control-plane
apiVersion:v1kind:Podmetadata:name:audit-podlabels:app:audit-podspec:securityContext:seccompProfile:type:LocalhostlocalhostProfile:profiles/audit.jsoncontainers:- name:test-containerimage:hashicorp/http-echo:0.2.3args:- "-text=just made some syscalls!"securityContext:allowPrivilegeEscalation:false
apiVersion:v1kind:Podmetadata:name:audit-podlabels:app:audit-podannotations:seccomp.security.alpha.kubernetes.io/pod:localhost/profiles/audit.jsonspec:containers:- name:test-containerimage:hashicorp/http-echo:0.2.3args:- "-text=just made some syscalls!"securityContext:allowPrivilegeEscalation:false
在集群中创建 Pod:
kubectl apply -f audit-pod.yaml
这个配置文件并不限制任何系统调用,所以这个 Pod 应该会成功启动。
kubectl get pod/audit-pod
NAME READY STATUS RESTARTS AGE
audit-pod 1/1 Running 0 30s
apiVersion:v1kind:Podmetadata:name:violation-podlabels:app:violation-podspec:securityContext:seccompProfile:type:LocalhostlocalhostProfile:profiles/violation.jsoncontainers:- name:test-containerimage:hashicorp/http-echo:0.2.3args:- "-text=just made some syscalls!"securityContext:allowPrivilegeEscalation:false
apiVersion:v1kind:Podmetadata:name:violation-podlabels:app:violation-podannotations:seccomp.security.alpha.kubernetes.io/pod:localhost/profiles/violation.jsonspec:containers:- name:test-containerimage:hashicorp/http-echo:0.2.3args:- "-text=just made some syscalls!"securityContext:allowPrivilegeEscalation:false
在集群中创建 Pod:
kubectl apply -f violation-pod.yaml
如果你检查 Pod 的状态,你将会看到该 Pod 启动失败。
kubectl get pod/violation-pod
NAME READY STATUS RESTARTS AGE
violation-pod 0/1 CrashLoopBackOff 1 6s
apiVersion:v1kind:Podmetadata:name:fine-podlabels:app:fine-podspec:securityContext:seccompProfile:type:LocalhostlocalhostProfile:profiles/fine-grained.jsoncontainers:- name:test-containerimage:hashicorp/http-echo:0.2.3args:- "-text=just made some syscalls!"securityContext:allowPrivilegeEscalation:false
apiVersion:v1kind:Podmetadata:name:fine-podlabels:app:fine-podannotations:seccomp.security.alpha.kubernetes.io/pod:localhost/profiles/fine-grained.jsonspec:containers:- name:test-containerimage:hashicorp/http-echo:0.2.3args:- "-text=just made some syscalls!"securityContext:allowPrivilegeEscalation:false
在你的集群上创建Pod:
kubectl apply -f fine-pod.yaml
Pod 应该被成功启动。
kubectl get pod/fine-pod
NAME READY STATUS RESTARTS AGE
fine-pod 1/1 Running 0 30s
apiVersion:v1kind:Podmetadata:name:audit-podlabels:app:audit-podspec:securityContext:seccompProfile:type:RuntimeDefaultcontainers:- name:test-containerimage:hashicorp/http-echo:0.2.3args:- "-text=just made some syscalls!"securityContext:allowPrivilegeEscalation:false
apiVersion:v1kind:Podmetadata:name:default-podlabels:app:default-podannotations:seccomp.security.alpha.kubernetes.io/pod:runtime/defaultspec:containers:- name:test-containerimage:hashicorp/http-echo:0.2.3args:- "-text=just made some syscalls!"securityContext:allowPrivilegeEscalation:false
NAME STATUS ROLES AGE VERSION
kubernetes-node-6jst Ready <none> 2h v1.13.0
kubernetes-node-cx31 Ready <none> 2h v1.13.0
kubernetes-node-jj1t Ready <none> 2h v1.13.0
GET /api/v1/pods?limit=500&continue=ENCODED_CONTINUE_TOKEN_2
---
200 OK
Content-Type: application/json
{
"kind": "PodList",
"apiVersion": "v1",
"metadata": {
"resourceVersion":"10245",
"continue": "", // continue token is empty because we have reached the end of the list
...
},
"items": [...] // returns pods 1001-1253
}
注意 list 操作的 resourceVersion 在每个请求中都设置的是同一个数值,
这表明服务器要向我们展示一个一致的 Pods 快照视图。
在版本 10245 之后创建、更新或删除的 Pods 都不会显示出来,除非用户发出
list 请求时不指定 continue 令牌。
这一设计使得客户端能够将较大的响应切分为较小的数据块,且能够对较大的集合
执行监视动作而不会错失任何更新事件。
apiVersion:v1kind:ConfigMapmetadata:name:test-cmnamespace:defaultlabels:test-label:testmanagedFields:- manager:kubectloperation:ApplyapiVersion:v1time:"2010-10-10T0:00:00Z"fieldsType:FieldsV1fieldsV1:f:metadata:f:labels:f:test-label:{}f:data:f:key:{}data:key:some value
apiVersion:v1kind:ConfigMapmetadata:name:test-cmnamespace:defaultlabels:test-label:testmanagedFields:- manager:kubectloperation:ApplyapiVersion:v1fields:f:metadata:f:labels:f:test-label:{}- manager:kube-controller-manageroperation:UpdateapiVersion:v1time:'2019-03-30T16:00:00.000Z'fields:f:data:f:key:{}data:key:new value
由于 Kubernetes 是一个 API 驱动的系统,API 会随着时间推移而演化,以反映
人们对问题空间的认识的变化。Kubernetes API 实际上是一个 API 集合,其中每个
成员称作“API 组(API Group)”,并且每个 API 组都是独立管理版本的。
API 版本会有
三类,每类有不同的废弃策略:
示例
分类
v1
正式发布(Generally available,GA,稳定版本)
v1beta1
Beta (预发布)
v1alpha1
Alpha (试验性)
给定的 Kubernetes 发布版本中可以支持任意数量的 API 组,且每组可以包含
任意个数的版本。
下面的规则负责指导 API 元素的弃用,具体元素包括:
REST 资源(也即 API 对象)
REST 资源的字段
REST 资源的注解,包含“beta”类注解但不包含“alpha”类注解
枚举值或者常数值
组件配置结构
以下是跨正式发布版本时要实施的规则,不适用于对 master 或发布分支上
不同提交之间的变化。
规则 #1:只能在增加 API 组版本号时删除 API 元素。
一旦在某个特定 API 组版本中添加了 API 元素,则该元素不可从该版本中删除,
且其行为也不能大幅度地变化,无论属于哪一类(GA、Alpha 或 Beta)。
说明: 由于历史原因,Kubernetes 中存在两个“单体式(Monolithic)”API 组 -
“core”(无组名)和“extensions”。这两个遗留 API 组中的资源会被逐渐迁移到
更为特定领域的 API 组中。
规则 #2:在给定的发布版本中,API 对象必须能够在不同的 API 版本之间来回
转换且不造成信息丢失,除非整个 REST 资源在某些版本中完全不存在。
没有策略可以覆盖所有情况。此策略文档是一个随时被更新的文档,会随着时间
推移演化。在实践中,会有一些情况无法很好地匹配到这里的弃用策略,或者
这里的策略变成了很严重的羁绊。这类情形要与 SIG 和项目牵头人讨论,
寻求对应场景的最佳解决方案。请一直铭记,Kubernetes 承诺要成为一个
稳定的系统,至少会尽力做到不会影响到其用户。此弃用策略的任何例外情况
都会在所有相关的发布说明中公布。
6.2.5 - Kubernetes API 健康端点
Kubernetes API 服务器 提供 API 端点以指示 API 服务器的当前状态。
本文描述了这些 API 端点,并说明如何使用。
API 健康端点
Kubernetes API 服务器提供 3 个 API 端点(healthz、livez 和 readyz)来表明 API 服务器的当前状态。
healthz 端点已被弃用(自 Kubernetes v1.16 起),你应该使用更为明确的 livez 和 readyz 端点。
livez 端点可与 --livez-grace-period标志一起使用,来指定启动持续时间。
为了正常关机,你可以使用 /readyz 端点并指定 --shutdown-delay-duration标志。
检查 API 服务器的 health/livez/readyz 端点的机器应依赖于 HTTP 状态代码。
状态码 200 表示 API 服务器是 healthy、live 还是 ready,具体取决于所调用的端点。
以下更详细的选项供操作人员使用,用来调试其集群或专门调试 API 服务器的状态。
以下示例将显示如何与运行状况 API 端点进行交互。
对于所有端点,都可以使用 verbose 参数来打印检查项以及检查状态。
这对于操作人员调试 API 服务器的当前状态很有用,这些不打算给机器使用:
curl -k https://localhost:6443/livez?verbose
或从具有身份验证的远程主机:
kubectl get --raw='/readyz?verbose'
输出将如下所示:
[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check passed
Kubernetes API 服务器也支持排除特定的检查项。
查询参数也可以像以下示例一样进行组合:
[+]ping ok
[+]log ok
[+]etcd excluded: ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
[+]shutdown ok
healthz check passed
Kubernetes 能以多种方式使用这些信息。
例如,调度器自动地尝试将 ReplicaSet 中的 Pod 打散在单可用区集群的不同节点上(以减少节点故障的影响,参见kubernetes.io/hostname)。
在多可用区的集群中,这类打散分布的行为也会应用到可用区(以减少可用区故障的影响)。
做到这一点靠的是 SelectorSpreadPriority。
SelectorSpreadPriority 是一种最大能力分配方法(best effort)。如果集群中的可用区是异构的(例如:不同数量的节点,不同类型的节点,或不同的 Pod 资源需求),这种分配方法可以防止平均分配 Pod 到可用区。如果需要,你可以用同构的可用区(相同数量和类型的节点)来减少潜在的不平衡分布。
调度器(通过 VolumeZonePredicate 的预测)也会保障声明了某卷的 Pod 只能分配到该卷相同的可用区。
卷不支持跨可用区挂载。
如果 PersistentVolumeLabel 不支持给 PersistentVolume 自动打标签,你可以考虑手动加标签(或增加 PersistentVolumeLabel 支持)。
有了 PersistentVolumeLabel,调度器可以防止 Pod 挂载不同可用区中的卷。
如果你的基础架构没有此限制,那你根本就没有必要给卷增加 zone 标签。
node.kubernetes.io/windows-build
示例: node.kubernetes.io/windows-build=10.0.17763
用于:Node
当 kubelet 运行于 Microsoft Windows,它给节点自动打标签,以记录 Windows Server 的版本。
--service-account-key-file 一个包含用来为持有者令牌签名的 PEM 编码密钥。
若未指定,则使用 API 服务器的 TLS 私钥。
--service-account-lookup 如果启用,则从 API 删除的令牌会被回收。
服务账号通常由 API 服务器自动创建并通过 ServiceAccount准入控制器
关联到集群中运行的 Pod 上。
持有者令牌会挂载到 Pod 中可预知的位置,允许集群内进程与 API 服务器通信。
服务账号也可以使用 Pod 规约的 serviceAccountName 字段显式地关联到 Pod 上。
关于上述第三条需求,即要求具备 CA 签名的证书,有一些额外的注意事项。
如果你部署了自己的身份服务,而不是使用云厂商(如 Google 或 Microsoft)所提供的服务,
你必须对身份服务的 Web 服务器证书进行签名,签名所用证书的 CA 标志要设置为
TRUE,即使用的是自签名证书。这是因为 GoLang 的 TLS 客户端实现对证书验证
标准方面有非常严格的要求。如果你手头没有现成的 CA 证书,可以使用 CoreOS
团队所开发的这个脚本
来创建一个简单的 CA 和被签了名的证书与密钥对。
或者你也可以使用
这个类似的脚本,
生成一个合法期更长、密钥尺寸更大的 SHA256 证书。
kubectl config set-credentials USER_NAME \
--auth-provider=oidc \
--auth-provider-arg=idp-issuer-url=( issuer url )\
--auth-provider-arg=client-id=( your client id )\
--auth-provider-arg=client-secret=( your client secret )\
--auth-provider-arg=refresh-token=( your refresh token )\
--auth-provider-arg=idp-certificate-authority=( path to your ca certificate )\
--auth-provider-arg=id-token=( your id_token )
kubectl --token=eyJhbGciOiJSUzI1NiJ9.eyJpc3MiOiJodHRwczovL21sYi50cmVtb2xvLmxhbjo4MDQzL2F1dGgvaWRwL29pZGMiLCJhdWQiOiJrdWJlcm5ldGVzIiwiZXhwIjoxNDc0NTk2NjY5LCJqdGkiOiI2RDUzNXoxUEpFNjJOR3QxaWVyYm9RIiwiaWF0IjoxNDc0NTk2MzY5LCJuYmYiOjE0NzQ1OTYyNDksInN1YiI6Im13aW5kdSIsInVzZXJfcm9sZSI6WyJ1c2VycyIsIm5ldy1uYW1lc3BhY2Utdmlld2VyIl0sImVtYWlsIjoibXdpbmR1QG5vbW9yZWplZGkuY29tIn0.f2As579n9VNoaKzoF-dOQGmXkFKf1FMyNV0-va_B63jn-_n9LGSCca_6IVMP8pO-Zb4KvRqGyTP0r3HkHxYy5c81AnIh8ijarruczl-TK_yF5akjSTHFZD-0gRzlevBDiH8Q79NAr-ky0P4iIXS8lY9Vnjch5MF74Zx0c3alKJHJUnnpjIACByfF2SCaYzbWFMUNat-K1PaUk5-ujMBG7yYnr95xD-63n8CO8teGUAAEMx6zRjzfhnhbzX-ajwZLGwGUBT4WqjMs70-6a7_8gZmLZb2az1cZynkFRj2BaCkVT3A2RrjeEwZEtGXlMqKJ1_I2ulrOVsYx01_yD35-rw get nodes
apiVersion:rbac.authorization.k8s.io/v1kind:ClusterRolemetadata:name:csr-approverrules:- apiGroups:- certificates.k8s.ioresources:- certificatesigningrequestsverbs:- get- list- watch- apiGroups:- certificates.k8s.ioresources:- certificatesigningrequests/approvalverbs:- update- apiGroups:- certificates.k8s.ioresources:- signersresourceNames:- example.com/my-signer-name# example.com/* can be used to authorize for all signers in the 'example.com' domainverbs:- approve
apiVersion:rbac.authorization.k8s.io/v1kind:ClusterRolemetadata:name:csr-signerrules:- apiGroups:- certificates.k8s.ioresources:- certificatesigningrequestsverbs:- get- list- watch- apiGroups:- certificates.k8s.ioresources:- certificatesigningrequests/statusverbs:- update- apiGroups:- certificates.k8s.ioresources:- signersresourceNames:- example.com/my-signer-name# example.com/* can be used to authorize for all signers in the 'example.com' domainverbs:- sign
apiVersion:certificates.k8s.io/v1kind:CertificateSigningRequest...status:conditions:- lastUpdateTime:"2020-02-08T11:37:35Z"lastTransitionTime:"2020-02-08T11:37:35Z"message:Approved by my custom approver controllerreason:ApprovedByMyPolicy# You can set this to any stringtype:Approved
驳回(Denied)的 CRS:
apiVersion:certificates.k8s.io/v1kind:CertificateSigningRequest...status:conditions:- lastUpdateTime:"2020-02-08T11:37:35Z"lastTransitionTime:"2020-02-08T11:37:35Z"message:Denied by my custom approver controllerreason:DeniedByMyPolicy# You can set this to any stringtype:Denied
该准入控制器会修改每一个新创建的 Pod 的镜像拉取策略为 Always 。
这在多租户集群中是有用的,这样用户就可以放心,他们的私有镜像只能被那些有凭证的人使用。
如果没有这个准入控制器,一旦镜像被拉取到节点上,任何用户的 Pod 都可以通过已了解到的镜像
的名称(假设 Pod 被调度到正确的节点上)来使用它,而不需要对镜像进行任何授权检查。
当启用这个准入控制器时,总是在启动容器之前拉取镜像,这意味着需要有效的凭证。
# Deprecated in v1.17 in favor of apiserver.config.k8s.io/v1apiVersion:apiserver.k8s.io/v1alpha1kind:AdmissionConfigurationplugins:- name:EventRateLimitpath:eventconfig.yaml...
准入控制器 PodTolerationRestriction 检查 Pod 的容忍度与其名字空间的容忍度之间
是否存在冲突。如果存在冲突,则拒绝 Pod 请求。
然后,它将名字空间的容忍度合并到 Pod 的容忍度中,之后根据名字空间的容忍度
白名单检查所得到的容忍度结果。如果检查成功,则将接受 Pod 请求,否则拒绝该请求。
如果 Pod 的名字空间没有任何关联的默认容忍度或容忍度白名单,则使用集群级别的
默认容忍度或容忍度白名单(如果有的话)。
apiVersion:admissionregistration.k8s.io/v1kind:ValidatingWebhookConfigurationmetadata:name:"pod-policy.example.com"webhooks:- name:"pod-policy.example.com"rules:- apiGroups:[""]apiVersions:["v1"]operations:["CREATE"]resources:["pods"]scope:"Namespaced"clientConfig:service:namespace:"example-namespace"name:"example-service"caBundle:"Ci0tLS0tQk...<base64-encoded PEM bundle containing the CA that signed the webhook's serving certificate>...tLS0K"admissionReviewVersions:["v1","v1beta1"]sideEffects:NonetimeoutSeconds:5
# 1.16 中被淘汰,推荐使用 admissionregistration.k8s.io/v1apiVersion:admissionregistration.k8s.io/v1beta1kind:ValidatingWebhookConfigurationmetadata:name:"pod-policy.example.com"webhooks:- name:"pod-policy.example.com"rules:- apiGroups:[""]apiVersions:["v1"]operations:["CREATE"]resources:["pods"]scope:"Namespaced"clientConfig:service:namespace:"example-namespace"name:"example-service"caBundle:"Ci0tLS0tQk...<base64-encoded PEM bundle containing the CA that signed the webhook's serving certificate>...tLS0K"admissionReviewVersions:["v1beta1"]timeoutSeconds:5
{
"apiVersion": "admission.k8s.io/v1",
"kind": "AdmissionReview",
"response": {
"uid": "<value from request.uid>",
"allowed": false,
"status": {
"code": 403,
"message": "You cannot do this because it is Tuesday and your name starts with A"
}
}
}
{
"apiVersion": "admission.k8s.io/v1beta1",
"kind": "AdmissionReview",
"response": {
"uid": "<value from request.uid>",
"allowed": false,
"status": {
"code": 403,
"message": "You cannot do this because it is Tuesday and your name starts with A"
}
}
}
apiVersion:admissionregistration.k8s.io/v1kind:MutatingWebhookConfiguration...webhooks:- name:my-webhook.example.comclientConfig:caBundle:"Ci0tLS0tQk...<base64-encoded PEM bundle containing the CA that signed the webhook's serving certificate>...tLS0K"service:namespace:my-service-namespacename:my-service-namepath:/my-pathport:1234...
# v1.16 中被废弃,推荐使用 admissionregistration.k8s.io/v1apiVersion:admissionregistration.k8s.io/v1beta1kind:MutatingWebhookConfiguration...webhooks:- name:my-webhook.example.comclientConfig:caBundle:"Ci0tLS0tQk...<base64-encoded PEM bundle containing the CA that signed the webhook's serving certificate>...tLS0K"service:namespace:my-service-namespacename:my-service-namepath:/my-pathport:1234...
# HELP apiserver_admission_webhook_rejection_count [ALPHA] Admission webhook rejection count, identified by name and broken out for each admission type (validating or admit) and operation. Additional labels specify an error type (calling_webhook_error or apiserver_internal_error if an error occurred; no_error otherwise) and optionally a non-zero rejection code if the webhook rejects the request with an HTTP status code (honored by the apiserver when the code is greater or equal to 400). Codes greater than 600 are truncated to 600, to keep the metrics cardinality bounded.
# TYPE apiserver_admission_webhook_rejection_count counter
apiserver_admission_webhook_rejection_count{error_type="calling_webhook_error",name="always-timeout-webhook.example.com",operation="CREATE",rejection_code="0",type="validating"} 1
apiserver_admission_webhook_rejection_count{error_type="calling_webhook_error",name="invalid-admission-response-webhook.example.com",operation="CREATE",rejection_code="0",type="validating"} 1
apiserver_admission_webhook_rejection_count{error_type="no_error",name="deny-unwanted-configmap-data.example.com",operation="CREATE",rejection_code="400",type="validating"} 13
在 Kubernetes API 中,大多数资源都是使用对象名称的字符串表示来呈现与访问的。
例如,对于 Pod 应使用 "pods"。
RBAC 使用对应 API 端点的 URL 中呈现的名字来引用资源。
有一些 Kubernetes API 涉及 子资源(subresource),例如 Pod 的日志。
对 Pod 日志的请求看起来像这样:
GET /api/v1/namespaces/{namespace}/pods/{name}/log
在这里,pods 对应名字空间作用域的 Pod 资源,而 log 是 pods 的子资源。
在 RBAC 角色表达子资源时,使用斜线(/)来分隔资源和子资源。
要允许某主体读取 pods 同时访问这些 Pod 的 log 子资源,你可以这么写:
{
"apiVersion": "authorization.k8s.io/v1beta1",
"kind": "SubjectAccessReview",
"status": {
"allowed": false,
"reason": "user does not have read access to the namespace"
}
}
{
"apiVersion": "authorization.k8s.io/v1beta1",
"kind": "SubjectAccessReview",
"status": {
"allowed": false,
"denied": true,
"reason": "user does not have read access to the namespace"
}
}
Kubernetes API 是通过 RESTful 接口提供 Kubernetes 功能服务并负责集群状态存储的应用程序。
Kubernetes 资源和"意向记录"都是作为 API 对象储存的,并可以通过调用 RESTful 风格的 API 进行修改。
API 允许以声明方式管理配置。
用户可以直接和 Kubernetes API 交互,也可以通过 kubectl 这样的工具进行交互。
核心的 Kubernetes API 是很灵活的,可以扩展以支持定制资源。
preflight Run pre-flight checks
certs Certificate generation
/ca Generate the self-signed Kubernetes CA to provision identities for other Kubernetes components
/apiserver Generate the certificate for serving the Kubernetes API
/apiserver-kubelet-client Generate the certificate for the API server to connect to kubelet
/front-proxy-ca Generate the self-signed CA to provision identities for front proxy
/front-proxy-client Generate the certificate for the front proxy client
/etcd-ca Generate the self-signed CA to provision identities for etcd
/etcd-server Generate the certificate for serving etcd
/etcd-peer Generate the certificate for etcd nodes to communicate with each other
/etcd-healthcheck-client Generate the certificate for liveness probes to healthcheck etcd
/apiserver-etcd-client Generate the certificate the apiserver uses to access etcd
/sa Generate a private key for signing service account tokens along with its public key
kubeconfig Generate all kubeconfig files necessary to establish the control plane and the admin kubeconfig file
/admin Generate a kubeconfig file for the admin to use and for kubeadm itself
/kubelet Generate a kubeconfig file for the kubelet to use *only* for cluster bootstrapping purposes
/controller-manager Generate a kubeconfig file for the controller manager to use
/scheduler Generate a kubeconfig file for the scheduler to use
kubelet-start Write kubelet settings and (re)start the kubelet
control-plane Generate all static Pod manifest files necessary to establish the control plane
/apiserver Generates the kube-apiserver static Pod manifest
/controller-manager Generates the kube-controller-manager static Pod manifest
/scheduler Generates the kube-scheduler static Pod manifest
etcd Generate static Pod manifest file for local etcd
/local Generate the static Pod manifest file for a local, single-node local etcd instance
upload-config Upload the kubeadm and kubelet configuration to a ConfigMap
/kubeadm Upload the kubeadm ClusterConfiguration to a ConfigMap
/kubelet Upload the kubelet component config to a ConfigMap
upload-certs Upload certificates to kubeadm-certs
mark-control-plane Mark a node as a control-plane
bootstrap-token Generates bootstrap tokens used to join a node to a cluster
kubelet-finalize Updates settings relevant to the kubelet after TLS bootstrap
/experimental-cert-rotation Enable kubelet client certificate rotation
addon Install required addons for passing Conformance tests
/coredns Install the CoreDNS addon to a Kubernetes cluster
/kube-proxy Install the kube-proxy addon to a Kubernetes cluster
kubeadm init [flags]
选项
--apiserver-advertise-address string
API 服务器所公布的其正在监听的 IP 地址。如果未设置,则使用默认网络接口。
--apiserver-bind-port int32 默认值:6443
API 服务器绑定的端口。
--apiserver-cert-extra-sans stringSlice
用于 API Server 服务证书的可选附加主题备用名称(SAN)。可以是 IP 地址和 DNS 名称。
preflight Run join pre-flight checks
control-plane-prepare Prepare the machine for serving a control plane
/download-certs [EXPERIMENTAL] Download certificates shared among control-plane nodes from the kubeadm-certs Secret
/certs Generate the certificates for the new control plane components
/kubeconfig Generate the kubeconfig for the new control plane components
/control-plane Generate the manifests for the new control plane components
kubelet-start Write kubelet settings, certificates and (re)start the kubelet
control-plane-join Join a machine as a control plane instance
/etcd Add a new local etcd member
/update-status Register the new control-plane node into the ClusterStatus maintained in the kubeadm-config ConfigMap
/mark-control-plane Mark a node as a control-plane
kubeadm join [api-server-endpoint] [flags]
选项
--apiserver-advertise-address string
如果该节点托管一个新的控制平面实例,则 API 服务器将公布其正在侦听的 IP 地址。如果未设置,则使用默认网络接口。
preflight Run reset pre-flight checks
update-cluster-status Remove this node from the ClusterStatus object.
remove-etcd-member Remove a local etcd member.
cleanup-node Run cleanup node.
preflight Run pre-flight checks
certs Certificate generation
/ca Generate the self-signed Kubernetes CA to provision identities for other Kubernetes components
/apiserver Generate the certificate for serving the Kubernetes API
/apiserver-kubelet-client Generate the certificate for the API server to connect to kubelet
/front-proxy-ca Generate the self-signed CA to provision identities for front proxy
/front-proxy-client Generate the certificate for the front proxy client
/etcd-ca Generate the self-signed CA to provision identities for etcd
/etcd-server Generate the certificate for serving etcd
/etcd-peer Generate the certificate for etcd nodes to communicate with each other
/etcd-healthcheck-client Generate the certificate for liveness probes to healthcheck etcd
/apiserver-etcd-client Generate the certificate the apiserver uses to access etcd
/sa Generate a private key for signing service account tokens along with its public key
kubeconfig Generate all kubeconfig files necessary to establish the control plane and the admin kubeconfig file
/admin Generate a kubeconfig file for the admin to use and for kubeadm itself
/kubelet Generate a kubeconfig file for the kubelet to use *only* for cluster bootstrapping purposes
/controller-manager Generate a kubeconfig file for the controller manager to use
/scheduler Generate a kubeconfig file for the scheduler to use
kubelet-start Write kubelet settings and (re)start the kubelet
control-plane Generate all static Pod manifest files necessary to establish the control plane
/apiserver Generates the kube-apiserver static Pod manifest
/controller-manager Generates the kube-controller-manager static Pod manifest
/scheduler Generates the kube-scheduler static Pod manifest
etcd Generate static Pod manifest file for local etcd
/local Generate the static Pod manifest file for a local, single-node local etcd instance
upload-config Upload the kubeadm and kubelet configuration to a ConfigMap
/kubeadm Upload the kubeadm ClusterConfiguration to a ConfigMap
/kubelet Upload the kubelet component config to a ConfigMap
upload-certs Upload certificates to kubeadm-certs
mark-control-plane Mark a node as a control-plane
bootstrap-token Generates bootstrap tokens used to join a node to a cluster
kubelet-finalize Updates settings relevant to the kubelet after TLS bootstrap
/experimental-cert-rotation Enable kubelet client certificate rotation
addon Install required addons for passing Conformance tests
/coredns Install the CoreDNS addon to a Kubernetes cluster
/kube-proxy Install the kube-proxy addon to a Kubernetes cluster
kubeadm init [flags]
选项
--apiserver-advertise-address string
API 服务器所公布的其正在监听的 IP 地址。如果未设置,则使用默认网络接口。
--apiserver-bind-port int32 默认值:6443
API 服务器绑定的端口。
--apiserver-cert-extra-sans stringSlice
用于 API Server 服务证书的可选附加主题备用名称(SAN)。可以是 IP 地址和 DNS 名称。
preflight Run join pre-flight checks
control-plane-prepare Prepare the machine for serving a control plane
/download-certs [EXPERIMENTAL] Download certificates shared among control-plane nodes from the kubeadm-certs Secret
/certs Generate the certificates for the new control plane components
/kubeconfig Generate the kubeconfig for the new control plane components
/control-plane Generate the manifests for the new control plane components
kubelet-start Write kubelet settings, certificates and (re)start the kubelet
control-plane-join Join a machine as a control plane instance
/etcd Add a new local etcd member
/update-status Register the new control-plane node into the ClusterStatus maintained in the kubeadm-config ConfigMap
/mark-control-plane Mark a node as a control-plane
kubeadm join [api-server-endpoint] [flags]
选项
--apiserver-advertise-address string
如果该节点托管一个新的控制平面实例,则 API 服务器将公布其正在侦听的 IP 地址。如果未设置,则使用默认网络接口。
preflight Run reset pre-flight checks
update-cluster-status Remove this node from the ClusterStatus object.
remove-etcd-member Remove a local etcd member.
cleanup-node Run cleanup node.
sudo chmod -x /usr/local/bin/kubectl-foo # 删除执行权限
kubectl plugin list
The following kubectl-compatible plugins are available:
/usr/local/bin/kubectl-hello
/usr/local/bin/kubectl-foo
- warning: /usr/local/bin/kubectl-foo identified as a plugin, but it is not executable
/usr/local/bin/kubectl-bar
error: one plugin warning was found
你可以将插件视为在现有 kubectl 命令之上构建更复杂功能的一种方法:
cat ./kubectl-whoami
接下来的几个示例假设你已经将 kubectl-whoami 设置为以下内容:
#!/bin/bash
#这个插件利用 `kubectl config` 命令基于当前所选上下文输出当前用户的信息
kubectl config view --template='{{ range .contexts }}{{ if eq .name "'$(kubectl config current-context)'" }}Current user: {{ printf "%s\n" .context.user }}{{ end }}{{ end }}'
kubectl get pods -o json
kubectl get pods -o=jsonpath='{@}'
kubectl get pods -o=jsonpath='{.items[0]}'
kubectl get pods -o=jsonpath='{.items[0].metadata.name}'
kubectl get pods -o=jsonpath="{.items[*]['metadata.name', 'status.capacity']}"
kubectl get pods -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.startTime}{"\n"}{end}'
说明:
在 Windows 上,对于任何包含空格的 JSONPath 模板,您必须使用双引号(不是上面 bash 所示的单引号)。
反过来,这意味着您必须在模板中的所有文字周围使用单引号或转义的双引号。
例如:
C:\> kubectl get pods -o=jsonpath="{range .items[*]}{.metadata.name}{'\t'}{.status.startTime}{'\n'}{end}"
C:\> kubectl get pods -o=jsonpath="{range .items[*]}{.metadata.name}{\"\t\"}{.status.startTime}{\"\n\"}{end}"
# 集群中运行着的所有镜像
kubectl get pods -A -o=custom-columns='DATA:spec.containers[*].image'# 列举 default 名字空间中运行的所有镜像,按 Pod 分组
kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,IMAGE:.spec.containers[*].image"# 除 "k8s.gcr.io/coredns:1.6.2" 之外的所有镜像
kubectl get pods -A -o=custom-columns='DATA:spec.containers[?(@.image!="k8s.gcr.io/coredns:1.6.2")].image'# 输出 metadata 下面的所有字段,无论 Pod 名字为何
kubectl get pods -A -o=custom-columns='DATA:metadata.*'
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
55c103fa1296 nginx "nginx -g 'daemon of…" 9 seconds ago Up 9 seconds 0.0.0.0:80->80/tcp nginx-app
kubectl:
# 启动运行 nginx 的 Pod
kubectl create deployment --image=nginx nginx-app
deployment.apps/nginx-app created
# add env to nginx-app
kubectl set env deployment/nginx-app DOMAIN=cluster
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
14636241935f ubuntu:16.04 "echo test" 5 seconds ago Exited (0) 5 seconds ago cocky_fermi
55c103fa1296 nginx "nginx -g 'daemon of…" About a minute ago Up About a minute 0.0.0.0:80->80/tcp nginx-app
使用 kubectl 命令:
kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-app-8df569cb7-4gd89 1/1 Running 0 3m
ubuntu 0/1 Completed 0 20s
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
55c103fa1296 nginx "nginx -g 'daemon of…" 5 minutes ago Up 5 minutes 0.0.0.0:80->80/tcp nginx-app
docker attach 55c103fa1296
...
kubectl:
kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-app-5jyvm 1/1 Running 0 10m
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
55c103fa1296 nginx "nginx -g 'daemon of…" 6 minutes ago Up 6 minutes 0.0.0.0:80->80/tcp nginx-app
docker exec 55c103fa1296 cat /etc/hostname
55c103fa1296
使用 kubectl 命令:
kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-app-5jyvm 1/1 Running 0 10m
现在是时候提一下 pod 和容器之间的细微差别了;默认情况下如果 pod 中的进程退出 pod 也不会终止,相反它将会重启该进程。这类似于 docker run 时的 --restart=always 选项, 这是主要差别。在 docker 中,进程的每个调用的输出都是被连接起来的,但是对于 kubernetes,每个调用都是分开的。要查看以前在 kubernetes 中执行的输出,请执行以下操作:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a9ec34d98787 nginx "nginx -g 'daemon of" 22 hours ago Up 22 hours 0.0.0.0:80->80/tcp, 443/tcp nginx-app
docker stop a9ec34d98787
a9ec34d98787
docker rm a9ec34d98787
a9ec34d98787
使用 kubectl 命令:
kubectl get deployment nginx-app
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
nginx-app 1 1 1 1 2m
kubectl get po -l app=nginx-app
NAME READY STATUS RESTARTS AGE
nginx-app-2883164633-aklf7 1/1 Running 0 2m
kubectl delete deployment nginx-app
deployment "nginx-app" deleted
kubectl get po -l app=nginx-app
# Return nothing
说明:
请注意,我们不直接删除 pod。使用 kubectl 命令,我们要删除拥有该 pod 的 Deployment。如果我们直接删除 pod,Deployment 将会重新创建该 pod。
Client version: 1.7.0
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 0baf609
OS/Arch (client): linux/amd64
Server version: 1.7.0
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 0baf609
OS/Arch (server): linux/amd64
Kubernetes master is running at https://108.59.85.141
KubeDNS is running at https://108.59.85.141/api/v1/namespaces/kube-system/services/kube-dns/proxy
kubernetes-dashboard is running at https://108.59.85.141/api/v1/namespaces/kube-system/services/kubernetes-dashboard/proxy
Grafana is running at https://108.59.85.141/api/v1/namespaces/kube-system/services/monitoring-grafana/proxy
Heapster is running at https://108.59.85.141/api/v1/namespaces/kube-system/services/monitoring-heapster/proxy
InfluxDB is running at https://108.59.85.141/api/v1/namespaces/kube-system/services/monitoring-influxdb/proxy
DNS 服务器的 IP 地址,以逗号分隔。此标志值用于 Pod 中设置了 “dnsPolicy=ClusterFirst”
时为容器提供 DNS 服务。注意:列表中出现的所有 DNS 服务器必须包含相同的记录组,
否则集群中的名称解析可能无法正常工作。至于名称解析过程中会牵涉到哪些 DNS 服务器,
这一点无法保证。
--config 时,默认值为 Webhook。
已弃用:应在 --config 所给的配置文件中进行设置。
(进一步了解)
一组启用或禁用内置 API 的键值对。支持的选项包括: v1=true|false(针对核心 API 组) <group>/<version>=true|false(针对特定 API 组和版本,例如:apps/v1=true) api/all=true|false 控制所有 API 版本 api/ga=true|false 控制所有 v[0-9]+ API 版本 api/beta=true|false 控制所有 v[0-9]+beta[0-9]+ API 版本 api/alpha=true|false 控制所有 v[0-9]+alpha[0-9]+ API 版本 api/legacy 已弃用,并将在以后的版本中删除
X509 证书和私钥文件路径的耦对。作为可选项,可以添加域名模式的列表,
其中每个域名模式都是可以带通配片段前缀的全限定域名(FQDN)。
域名模式也可以使用 IP 地址字符串,不过只有 API 服务器在所给 IP 地址上
对客户端可见时才可以使用 IP 地址。在未提供域名模式时,从证书中提取域名。
如果有非通配方式的匹配,则优先于通配方式的匹配;显式的域名模式优先于提取的域名。
当存在多个密钥/证书耦对时,可以多次使用 --tls-sni-cert-key 标志。
例如:example.crt,example.key 或 foo.crt,foo.key:\*.foo.com,foo.com。
--unhealthy-zone-threshold float32 默认值:0.55
仅当给定区域中处于非就绪状态的节点(最少 3 个)的占比高于此值时,
才将该区域视为不健康。
--use-service-account-credentials
当此标志为 true 时,为每个控制器单独使用服务账号凭据。
-v, --v int
日志级别详细程度取值。
--version version[=true]
打印版本信息之后退出。
--vmodule <逗号分隔的 'pattern=N' 配置值>
由逗号分隔的列表,每一项都是 pattern=N 格式,用来执行根据文件过滤的日志行为。
--volume-host-allow-local-loopback 默认值:true
此标志为 false 时,禁止本地回路 IP 地址和 --volume-host-cidr-denylist
中所指定的 CIDR 范围。
--volume-host-cidr-denylist strings
用逗号分隔的一个 CIDR 范围列表,禁止使用这些地址上的卷插件。
6.9.5 - kube-proxy
简介
Kubernetes 网络代理在每个节点上运行。网络代理反映了每个节点上 Kubernetes API
中定义的服务,并且可以执行简单的 TCP、UDP 和 SCTP 流转发,或者在一组后端进行
循环 TCP、UDP 和 SCTP 转发。
当前可通过 Docker-links-compatible 环境变量找到服务集群 IP 和端口,
这些环境变量指定了服务代理打开的端口。
有一个可选的插件,可以为这些集群 IP 提供集群 DNS。
用户必须使用 apiserver API 创建服务才能配置代理。
Cluster contains information to allow an exec plugin to communicate
with the kubernetes cluster being authenticated to.
To ensure that this struct contains everything someone would need to communicate
with a kubernetes cluster (just like they would via a kubeconfig), the fields
should shadow "k8s.io/client-go/tools/clientcmd/api/v1".Cluster, with the exception
of CertificateAuthority, since CA data will always be passed to the plugin as bytes.
Field
Description
server[Required] string
Server is the address of the kubernetes cluster (https://hostname:port).
tls-server-name string
TLSServerName is passed to the server for SNI and is used in the client to
check server certificates against. If ServerName is empty, the hostname
used to contact the server is used.
insecure-skip-tls-verify bool
InsecureSkipTLSVerify skips the validity check for the server's certificate.
This will make your HTTPS connections insecure.
certificate-authority-data []byte
CAData contains PEM-encoded certificate authority certificates.
If empty, system roots should be used.
proxy-url string
ProxyURL is the URL to the proxy to be used for all requests to this
cluster.
Config holds additional config data that is specific to the exec
plugin with regards to the cluster being authenticated to.
This data is sourced from the clientcmd Cluster object's
extensions[client.authentication.k8s.io/exec] field:
clusters:
name: my-cluster
cluster:
...
extensions:
name: client.authentication.k8s.io/exec # reserved extension name for per cluster exec config
extension:
audience: 06e3fbd18de8 # arbitrary config
In some environments, the user config may be exactly the same across many clusters
(i.e. call this exec plugin) minus some details that are specific to each cluster
such as the audience. This field allows the per cluster config to be directly
specified with the cluster info. Using this field to store secret data is not
recommended as one of the prime benefits of exec plugins is that no secrets need
to be stored directly in the kubeconfig.
Cluster contains information to allow an exec plugin to communicate with the
kubernetes cluster being authenticated to. Note that Cluster is non-nil only
when provideClusterInfo is set to true in the exec provider config (i.e.,
ExecConfig.ProvideClusterInfo).
ExecCredentialStatus holds credentials for the transport to use.
Token and ClientKeyData are sensitive fields. This data should only be
transmitted in-memory between client and exec plugin process. Exec plugin
itself should at least be protected via file permissions.
The response status, populated even when the ResponseObject is not a Status type.
For successful responses, this will only include the Code and StatusSuccess.
For non-status type error responses, this will be auto-populated with the error Message.
API object from the request, in JSON format. The RequestObject is recorded as-is in the request
(possibly re-encoded as JSON), prior to version conversion, defaulting, admission or
merging. It is an external versioned object type, and may not be a valid object on its own.
Omitted for non-resource requests. Only logged at Request Level and higher.
API object returned in the response, in JSON. The ResponseObject is recorded after conversion
to the external type, and serialized as JSON. Omitted for non-resource requests. Only logged
at Response Level.
Annotations is an unstructured key value map stored with an audit event that may be set by
plugins invoked in the request serving chain, including authentication, authorization and
admission plugins. Note that these annotations are for the audit event, and do not correspond
to the metadata.annotations of the submitted object. Keys should uniquely identify the informing
component to avoid name collisions (e.g. podsecuritypolicy.admission.k8s.io/policy). Values
should be short. Annotations are included in the Metadata level.
Rules specify the audit Level a request should be recorded at.
A request may match multiple rules, in which case the FIRST matching rule is used.
The default audit level is None, but can be overridden by a catch-all rule at the end of the list.
PolicyRules are strictly ordered.
OmitStages is a list of stages for which no events are created. Note that this can also
be specified per rule in which case the union of both are omitted.
GroupResources represents resource kinds in an API group.
Field
Description
group string
Group is the name of the API group that contains the resources.
The empty string represents the core API group.
resources []string
Resources is a list of resources this rule applies to.
For example:
'pods' matches pods.
'pods/log' matches the log subresource of pods.
'∗' matches all resources and their subresources.
'pods/∗' matches all subresources of pods.
'∗/scale' matches all scale subresources.
If wildcard is present, the validation rule will ensure resources do not
overlap with each other.
An empty list implies all resources and subresources in this API groups apply.
resourceNames []string
ResourceNames is a list of resource instance names that the policy matches.
Using this field requires Resources to be specified.
An empty list implies that every instance of the resource is matched.
The Level that requests matching this rule are recorded at.
users []string
The users (by authenticated user name) this rule applies to.
An empty list implies every user.
userGroups []string
The user groups this rule applies to. A user is considered matching
if it is a member of any of the UserGroups.
An empty list implies every user group.
verbs []string
The verbs that match this rule.
An empty list implies every verb.
Resources that this rule matches. An empty list implies all kinds in all API groups.
namespaces []string
Namespaces that this rule matches.
The empty string "" matches non-namespaced resources.
An empty list implies every namespace.
nonResourceURLs []string
NonResourceURLs is a set of URL paths that should be audited.
∗s are allowed, but only as the full, final step in the path.
Examples:
"/metrics" - Log requests for apiserver metrics
"/healthz∗" - Log all health checks
OmitStages is a list of stages for which no events are created. Note that this can also
be specified policy wide in which case the union of both are omitted.
An empty list means no restrictions will apply.
KubeProxyConfiguration contains everything necessary to configure the
Kubernetes proxy server.
Field
Description
apiVersion string
kubeproxy.config.k8s.io/v1alpha1
kind string
KubeProxyConfiguration
featureGates[Required] map[string]bool
featureGates is a map of feature names to bools that enable or disable alpha/experimental features.
bindAddress[Required] string
bindAddress is the IP address for the proxy server to serve on (set to 0.0.0.0
for all interfaces)
healthzBindAddress[Required] string
healthzBindAddress is the IP address and port for the health check server to serve on,
defaulting to 0.0.0.0:10256
metricsBindAddress[Required] string
metricsBindAddress is the IP address and port for the metrics server to serve on,
defaulting to 127.0.0.1:10249 (set to 0.0.0.0 for all interfaces)
bindAddressHardFail[Required] bool
bindAddressHardFail, if true, kube-proxy will treat failure to bind to a port as fatal and exit
enableProfiling[Required] bool
enableProfiling enables profiling via web interface on /debug/pprof handler.
Profiling handlers will be handled by metrics server.
clusterCIDR[Required] string
clusterCIDR is the CIDR range of the pods in the cluster. It is used to
bridge traffic coming from outside of the cluster. If not provided,
no off-cluster bridging will be performed.
hostnameOverride[Required] string
hostnameOverride, if non-empty, will be used as the identity instead of the actual hostname.
portRange is the range of host ports (beginPort-endPort, inclusive) that may be consumed
in order to proxy service traffic. If unspecified (0-0) then ports will be randomly chosen.
udpIdleTimeout is how long an idle UDP connection will be kept open (e.g. '250ms', '2s').
Must be greater than 0. Only applicable for proxyMode=userspace.
configSyncPeriod is how often configuration from the apiserver is refreshed. Must be greater
than 0.
nodePortAddresses[Required] []string
nodePortAddresses is the --nodeport-addresses value for kube-proxy process. Values must be valid
IP blocks. These values are as a parameter to select the interfaces where nodeport works.
In case someone would like to expose a service on localhost for local visit and some other interfaces for
particular purpose, a list of IP blocks would do that.
If set it to "127.0.0.0/8", kube-proxy will only select the loopback interface for NodePort.
If set it to a non-zero IP block, kube-proxy will filter that down to just the IPs that applied to the node.
An empty string slice is meant to select all network interfaces.
tcpCloseWaitTimeout is how long an idle conntrack entry
in CLOSE_WAIT state will remain in the conntrack
table. (e.g. '60s'). Must be greater than 0 to set.
tcpFinTimeout is the timeout value used for IPVS TCP sessions after receiving a FIN.
The default value is 0, which preserves the current timeout value on the system.
ProxyMode represents modes used by the Kubernetes proxy server.
Currently, three modes of proxy are available in Linux platform: 'userspace' (older, going to be EOL), 'iptables'
(newer, faster), 'ipvs'(newest, better in performance and scalability).
Two modes of proxy are available in Windows platform: 'userspace'(older, stable) and 'kernelspace' (newer, faster).
In Linux platform, if proxy mode is blank, use the best-available proxy (currently iptables, but may change in the
future). If the iptables proxy is selected, regardless of how, but the system's kernel or iptables versions are
insufficient, this always falls back to the userspace proxy. IPVS mode will be enabled when proxy mode is set to 'ipvs',
and the fall back path is firstly iptables and then userspace.
In Windows platform, if proxy mode is blank, use the best-available proxy (currently userspace, but may change in the
future). If winkernel proxy is selected, regardless of how, but the Windows kernel can't support this mode of proxy,
this always falls back to the userspace proxy.
ClientConnectionConfiguration contains details for constructing a client.
Field
Description
kubeconfig[Required] string
kubeconfig is the path to a KubeConfig file.
acceptContentTypes[Required] string
acceptContentTypes defines the Accept header sent by clients when connecting to a server, overriding the
default value of 'application/json'. This field will control all connections to the server used by a particular
client.
contentType[Required] string
contentType is the content type used when sending data to the server from this client.
qps[Required] float32
qps controls the number of queries per second allowed for this connection.
burst[Required] int32
burst allows extra queries to accumulate when a client is exceeding its rate.
Holds the information to communicate with the extender(s)
hardPodAffinitySymmetricWeight[Required] int32
RequiredDuringScheduling affinity is not symmetric, but there is an implicit PreferredDuringScheduling affinity rule
corresponding to every RequiredDuringScheduling affinity rule.
HardPodAffinitySymmetricWeight represents the weight of implicit PreferredDuringScheduling affinity rule, in the range 1-100.
alwaysCheckAllPredicates[Required] bool
When AlwaysCheckAllPredicates is set to true, scheduler checks all
the configured predicates even after one or more of them fails.
When the flag is set to false, scheduler skips checking the rest
of the predicates after it finds one predicate that failed.
ExtenderTLSConfig contains settings to enable TLS with extender
Field
Description
insecure[Required] bool
Server should be accessed without verifying the TLS certificate. For testing only.
serverName[Required] string
ServerName is passed to the server for SNI and is used in the client to check server
certificates against. If ServerName is empty, the hostname used to contact the
server is used.
certFile[Required] string
Server requires TLS client certificate authentication
keyFile[Required] string
Server requires TLS client certificate authentication
caFile[Required] string
Trusted root certificates for server
certData[Required] []byte
CertData holds PEM-encoded bytes (typically read from a client certificate file).
CertData takes precedence over CertFile
keyData[Required] []byte
KeyData holds PEM-encoded bytes (typically read from a client certificate key file).
KeyData takes precedence over KeyFile
caData[Required] []byte
CAData holds PEM-encoded bytes (typically read from a root certificates bundle).
CAData takes precedence over CAFile
LabelPreference holds the parameters that are used to configure the corresponding priority function
Field
Description
label[Required] string
Used to identify node "groups"
presence[Required] bool
This is a boolean flag
If true, higher priority is given to nodes that have the label
If false, higher priority is given to nodes that do not have the label
LabelsPresence holds the parameters that are used to configure the corresponding predicate in scheduler policy configuration.
Field
Description
labels[Required] []string
The list of labels that identify node "groups"
All of the labels should be either present (or absent) for the node to be considered a fit for hosting the pod
presence[Required] bool
The boolean flag that indicates whether the labels should be present or absent from the node
LegacyExtender holds the parameters used to communicate with the extender. If a verb is unspecified/empty,
it is assumed that the extender chose not to provide that extension.
Field
Description
urlPrefix[Required] string
URLPrefix at which the extender is available
filterVerb[Required] string
Verb for the filter call, empty if not supported. This verb is appended to the URLPrefix when issuing the filter call to extender.
preemptVerb[Required] string
Verb for the preempt call, empty if not supported. This verb is appended to the URLPrefix when issuing the preempt call to extender.
prioritizeVerb[Required] string
Verb for the prioritize call, empty if not supported. This verb is appended to the URLPrefix when issuing the prioritize call to extender.
weight[Required] int64
The numeric multiplier for the node scores that the prioritize call generates.
The weight should be a positive integer
bindVerb[Required] string
Verb for the bind call, empty if not supported. This verb is appended to the URLPrefix when issuing the bind call to extender.
If this method is implemented by the extender, it is the extender's responsibility to bind the pod to apiserver. Only one extender
can implement this function.
enableHttps[Required] bool
EnableHTTPS specifies whether https should be used to communicate with the extender
HTTPTimeout specifies the timeout duration for a call to the extender. Filter timeout fails the scheduling of the pod. Prioritize
timeout is ignored, k8s/other extenders priorities are used to select the node.
nodeCacheCapable[Required] bool
NodeCacheCapable specifies that the extender is capable of caching node information,
so the scheduler should only send minimal information about the eligible nodes
assuming that the extender already cached full details of all nodes in the cluster
ManagedResources is a list of extended resources that are managed by
this extender.
- A pod will be sent to the extender on the Filter, Prioritize and Bind
(if the extender is the binder) phases iff the pod requests at least
one of the extended resources in this list. If empty or unspecified,
all pods will be sent to this extender.
- If IgnoredByScheduler is set to true for a resource, kube-scheduler
will skip checking the resource in predicates.
ignorable[Required] bool
Ignorable specifies if the extender is ignorable, i.e. scheduling should not
fail when the extender returns an error or is not reachable.
PredicateArgument represents the arguments to configure predicate functions in scheduler policy configuration.
Only one of its members may be specified
PredicatePolicy describes a struct of a predicate policy.
Field
Description
name[Required] string
Identifier of the predicate policy
For a custom predicate, the name can be user-defined
For the Kubernetes provided predicates, the name is the identifier of the pre-defined predicate
The priority function that ensures a good spread (anti-affinity) for pods belonging to a service
It uses a label to identify nodes that belong to the same "group"
PriorityPolicy describes a struct of a priority policy.
Field
Description
name[Required] string
Identifier of the priority policy
For a custom priority, the name can be user-defined
For the Kubernetes provided priority functions, the name is the identifier of the pre-defined priority function
weight[Required] int64
The numeric multiplier for the node scores that the priority function generates
The weight should be non-zero and can be a positive or a negative integer
ClientConnectionConfiguration contains details for constructing a client.
Field
Description
kubeconfig[Required] string
kubeconfig is the path to a KubeConfig file.
acceptContentTypes[Required] string
acceptContentTypes defines the Accept header sent by clients when connecting to a server, overriding the
default value of 'application/json'. This field will control all connections to the server used by a particular
client.
contentType[Required] string
contentType is the content type used when sending data to the server from this client.
qps[Required] float32
qps controls the number of queries per second allowed for this connection.
burst[Required] int32
burst allows extra queries to accumulate when a client is exceeding its rate.
LeaderElectionConfiguration defines the configuration of leader election
clients for components that can run with leader election enabled.
Field
Description
leaderElect[Required] bool
leaderElect enables a leader election client to gain leadership
before executing the main loop. Enable this when running replicated
components for high availability.
leaseDuration is the duration that non-leader candidates will wait
after observing a leadership renewal until attempting to acquire
leadership of a led but unrenewed leader slot. This is effectively the
maximum duration that a leader can be stopped before it is replaced
by another candidate. This is only applicable if leader election is
enabled.
renewDeadline is the interval between attempts by the acting master to
renew a leadership slot before it stops leading. This must be less
than or equal to the lease duration. This is only applicable if leader
election is enabled.
retryPeriod is the duration the clients should wait between attempting
acquisition and renewal of a leadership. This is only applicable if
leader election is enabled.
resourceLock[Required] string
resourceLock indicates the resource object type that will be used to lock
during leader election cycles.
resourceName[Required] string
resourceName indicates the name of resource object that will be used to lock
during leader election cycles.
resourceNamespace[Required] string
resourceName indicates the namespace of resource object that will be used to lock
during leader election cycles.
LoggingConfiguration contains logging options
Refer Logs Options for more information.
Field
Description
format[Required] string
Format Flag specifies the structure of log messages.
default value of format is `text`
sanitization[Required] bool
[Experimental] When enabled prevents logging of fields tagged as sensitive (passwords, keys, tokens).
Runtime log sanitization may introduce significant computation overhead and therefore should not be enabled in production.`)
DefaultPreemptionArgs
DefaultPreemptionArgs holds arguments used to configure the
DefaultPreemption plugin.
Field
Description
apiVersion string
kubescheduler.config.k8s.io/v1beta1
kind string
DefaultPreemptionArgs
minCandidateNodesPercentage[Required] int32
MinCandidateNodesPercentage is the minimum number of candidates to
shortlist when dry running preemption as a percentage of number of nodes.
Must be in the range [0, 100]. Defaults to 10% of the cluster size if
unspecified.
minCandidateNodesAbsolute[Required] int32
MinCandidateNodesAbsolute is the absolute minimum number of candidates to
shortlist. The likely number of candidates enumerated for dry running
preemption is given by the formula:
numCandidates = max(numNodes ∗ minCandidateNodesPercentage, minCandidateNodesAbsolute)
We say "likely" because there are other factors such as PDB violations
that play a role in the number of candidates shortlisted. Must be at least
0 nodes. Defaults to 100 nodes if unspecified.
InterPodAffinityArgs
InterPodAffinityArgs holds arguments used to configure the InterPodAffinity plugin.
Field
Description
apiVersion string
kubescheduler.config.k8s.io/v1beta1
kind string
InterPodAffinityArgs
hardPodAffinityWeight[Required] int32
HardPodAffinityWeight is the scoring weight for existing pods with a
matching hard affinity to the incoming pod.
KubeSchedulerConfiguration
KubeSchedulerConfiguration configures a scheduler
Field
Description
apiVersion string
kubescheduler.config.k8s.io/v1beta1
kind string
KubeSchedulerConfiguration
parallelism[Required] int32
Parallelism defines the amount of parallelism in algorithms for scheduling a Pods. Must be greater than 0. Defaults to 16
(Members of DebuggingConfiguration are embedded into this type.)
DebuggingConfiguration holds configuration for Debugging related features
TODO: We might wanna make this a substruct like Debugging componentbaseconfigv1alpha1.DebuggingConfiguration
percentageOfNodesToScore[Required] int32
PercentageOfNodesToScore is the percentage of all nodes that once found feasible
for running a pod, the scheduler stops its search for more feasible nodes in
the cluster. This helps improve scheduler's performance. Scheduler always tries to find
at least "minFeasibleNodesToFind" feasible nodes no matter what the value of this flag is.
Example: if the cluster size is 500 nodes and the value of this flag is 30,
then scheduler stops finding further feasible nodes once it finds 150 feasible ones.
When the value is 0, default percentage (5%--50% based on the size of the cluster) of the
nodes will be scored.
podInitialBackoffSeconds[Required] int64
PodInitialBackoffSeconds is the initial backoff for unschedulable pods.
If specified, it must be greater than 0. If this value is null, the default value (1s)
will be used.
podMaxBackoffSeconds[Required] int64
PodMaxBackoffSeconds is the max backoff for unschedulable pods.
If specified, it must be greater than podInitialBackoffSeconds. If this value is null,
the default value (10s) will be used.
Profiles are scheduling profiles that kube-scheduler supports. Pods can
choose to be scheduled under a particular profile by setting its associated
scheduler name. Pods that don't specify any scheduler name are scheduled
with the "default-scheduler" profile, if present here.
Extenders are the list of scheduler extenders, each holding the values of how to communicate
with the extender. These extenders are shared by all scheduler profiles.
NodeAffinityArgs
NodeAffinityArgs holds arguments to configure the NodeAffinity plugin.
AddedAffinity is applied to all Pods additionally to the NodeAffinity
specified in the PodSpec. That is, Nodes need to satisfy AddedAffinity
AND .spec.NodeAffinity. AddedAffinity is empty by default (all Nodes
match).
When AddedAffinity is used, some Pods with affinity requirements that match
a specific Node (such as Daemonset Pods) might remain unschedulable.
NodeLabelArgs
NodeLabelArgs holds arguments used to configure the NodeLabel plugin.
Field
Description
apiVersion string
kubescheduler.config.k8s.io/v1beta1
kind string
NodeLabelArgs
presentLabels[Required] []string
PresentLabels should be present for the node to be considered a fit for hosting the pod
absentLabels[Required] []string
AbsentLabels should be absent for the node to be considered a fit for hosting the pod
presentLabelsPreference[Required] []string
Nodes that have labels in the list will get a higher score.
absentLabelsPreference[Required] []string
Nodes that don't have labels in the list will get a higher score.
NodeResourcesFitArgs
NodeResourcesFitArgs holds arguments used to configure the NodeResourcesFit plugin.
Field
Description
apiVersion string
kubescheduler.config.k8s.io/v1beta1
kind string
NodeResourcesFitArgs
ignoredResources[Required] []string
IgnoredResources is the list of resources that NodeResources fit filter
should ignore.
ignoredResourceGroups[Required] []string
IgnoredResourceGroups defines the list of resource groups that NodeResources fit filter should ignore.
e.g. if group is ["example.com"], it will ignore all resource names that begin
with "example.com", such as "example.com/aaa" and "example.com/bbb".
A resource group name can't contain '/'.
NodeResourcesLeastAllocatedArgs
NodeResourcesLeastAllocatedArgs holds arguments used to configure NodeResourcesLeastAllocated plugin.
Resources to be managed, if no resource is provided, default resource set with both
the weight of "cpu" and "memory" set to "1" will be applied.
Resource with "0" weight will not accountable for the final score.
NodeResourcesMostAllocatedArgs
NodeResourcesMostAllocatedArgs holds arguments used to configure NodeResourcesMostAllocated plugin.
Resources to be managed, if no resource is provided, default resource set with both
the weight of "cpu" and "memory" set to "1" will be applied.
Resource with "0" weight will not accountable for the final score.
PodTopologySpreadArgs
PodTopologySpreadArgs holds arguments used to configure the PodTopologySpread plugin.
DefaultConstraints defines topology spread constraints to be applied to
Pods that don't define any in `pod.spec.topologySpreadConstraints`.
`.defaultConstraints[∗].labelSelectors` must be empty, as they are
deduced from the Pod's membership to Services, ReplicationControllers,
ReplicaSets or StatefulSets.
When not empty, .defaultingType must be "List".
ServiceAffinityArgs holds arguments used to configure the ServiceAffinity plugin.
Field
Description
apiVersion string
kubescheduler.config.k8s.io/v1beta1
kind string
ServiceAffinityArgs
affinityLabels[Required] []string
AffinityLabels are homogeneous for pods that are scheduled to a node.
(i.e. it returns true IFF this pod can be added to this node such that all other pods in
the same service are running on nodes with the exact same values for Labels).
antiAffinityLabelsPreference[Required] []string
AntiAffinityLabelsPreference are the labels to consider for service anti affinity scoring.
VolumeBindingArgs
VolumeBindingArgs holds arguments used to configure the VolumeBinding plugin.
Field
Description
apiVersion string
kubescheduler.config.k8s.io/v1beta1
kind string
VolumeBindingArgs
bindTimeoutSeconds[Required] int64
BindTimeoutSeconds is the timeout in seconds in volume binding operation.
Value must be non-negative integer. The value zero indicates no waiting.
If this value is nil, the default value (600) will be used.
Extender holds the parameters used to communicate with the extender. If a verb is unspecified/empty,
it is assumed that the extender chose not to provide that extension.
Field
Description
urlPrefix[Required] string
URLPrefix at which the extender is available
filterVerb[Required] string
Verb for the filter call, empty if not supported. This verb is appended to the URLPrefix when issuing the filter call to extender.
preemptVerb[Required] string
Verb for the preempt call, empty if not supported. This verb is appended to the URLPrefix when issuing the preempt call to extender.
prioritizeVerb[Required] string
Verb for the prioritize call, empty if not supported. This verb is appended to the URLPrefix when issuing the prioritize call to extender.
weight[Required] int64
The numeric multiplier for the node scores that the prioritize call generates.
The weight should be a positive integer
bindVerb[Required] string
Verb for the bind call, empty if not supported. This verb is appended to the URLPrefix when issuing the bind call to extender.
If this method is implemented by the extender, it is the extender's responsibility to bind the pod to apiserver. Only one extender
can implement this function.
enableHTTPS[Required] bool
EnableHTTPS specifies whether https should be used to communicate with the extender
HTTPTimeout specifies the timeout duration for a call to the extender. Filter timeout fails the scheduling of the pod. Prioritize
timeout is ignored, k8s/other extenders priorities are used to select the node.
nodeCacheCapable[Required] bool
NodeCacheCapable specifies that the extender is capable of caching node information,
so the scheduler should only send minimal information about the eligible nodes
assuming that the extender already cached full details of all nodes in the cluster
ManagedResources is a list of extended resources that are managed by
this extender.
- A pod will be sent to the extender on the Filter, Prioritize and Bind
(if the extender is the binder) phases iff the pod requests at least
one of the extended resources in this list. If empty or unspecified,
all pods will be sent to this extender.
- If IgnoredByScheduler is set to true for a resource, kube-scheduler
will skip checking the resource in predicates.
ignorable[Required] bool
Ignorable specifies if the extender is ignorable, i.e. scheduling should not
fail when the extender returns an error or is not reachable.
SchedulerName is the name of the scheduler associated to this profile.
If SchedulerName matches with the pod's "spec.schedulerName", then the pod
is scheduled with this profile.
Plugins specify the set of plugins that should be enabled or disabled.
Enabled plugins are the ones that should be enabled in addition to the
default plugins. Disabled plugins are any of the default plugins that
should be disabled.
When no enabled or disabled plugin is specified for an extension point,
default plugins for that extension point will be used if there is any.
If a QueueSort plugin is specified, the same QueueSort Plugin and
PluginConfig must be specified for all profiles.
PluginConfig is an optional set of custom plugin arguments for each plugin.
Omitting config args for a plugin is equivalent to using the default config
for that plugin.
PluginConfig specifies arguments that should be passed to a plugin at the time of initialization.
A plugin that is invoked at multiple extension points is initialized once. Args can have arbitrary structure.
It is up to the plugin to process these Args.
PluginSet specifies enabled and disabled plugins for an extension point.
If an array is empty, missing, or nil, default plugins at that extension point will be used.
Enabled specifies plugins that should be enabled in addition to default plugins.
These are called after default plugins and in the same order specified here.
Disabled specifies default plugins that should be disabled.
When all default plugins need to be disabled, an array containing only one "∗" should be provided.
Plugins include multiple extension points. When specified, the list of plugins for
a particular extension point are the only ones enabled. If an extension point is
omitted from the config, then the default set of plugins is used for that extension point.
Enabled plugins are called in the order specified here, after default plugins. If they need to
be invoked before default plugins, default plugins must be disabled and re-enabled here in desired order.
Bind is a list of plugins that should be invoked at "Bind" extension point of the scheduling framework.
The scheduler call these plugins in order. Scheduler skips the rest of these plugins as soon as one returns success.
Holds the information to communicate with the extender(s)
hardPodAffinitySymmetricWeight[Required] int32
RequiredDuringScheduling affinity is not symmetric, but there is an implicit PreferredDuringScheduling affinity rule
corresponding to every RequiredDuringScheduling affinity rule.
HardPodAffinitySymmetricWeight represents the weight of implicit PreferredDuringScheduling affinity rule, in the range 1-100.
alwaysCheckAllPredicates[Required] bool
When AlwaysCheckAllPredicates is set to true, scheduler checks all
the configured predicates even after one or more of them fails.
When the flag is set to false, scheduler skips checking the rest
of the predicates after it finds one predicate that failed.
ExtenderTLSConfig contains settings to enable TLS with extender
Field
Description
insecure[Required] bool
Server should be accessed without verifying the TLS certificate. For testing only.
serverName[Required] string
ServerName is passed to the server for SNI and is used in the client to check server
certificates against. If ServerName is empty, the hostname used to contact the
server is used.
certFile[Required] string
Server requires TLS client certificate authentication
keyFile[Required] string
Server requires TLS client certificate authentication
caFile[Required] string
Trusted root certificates for server
certData[Required] []byte
CertData holds PEM-encoded bytes (typically read from a client certificate file).
CertData takes precedence over CertFile
keyData[Required] []byte
KeyData holds PEM-encoded bytes (typically read from a client certificate key file).
KeyData takes precedence over KeyFile
caData[Required] []byte
CAData holds PEM-encoded bytes (typically read from a root certificates bundle).
CAData takes precedence over CAFile
LabelPreference holds the parameters that are used to configure the corresponding priority function
Field
Description
label[Required] string
Used to identify node "groups"
presence[Required] bool
This is a boolean flag
If true, higher priority is given to nodes that have the label
If false, higher priority is given to nodes that do not have the label
LabelsPresence holds the parameters that are used to configure the corresponding predicate in scheduler policy configuration.
Field
Description
labels[Required] []string
The list of labels that identify node "groups"
All of the labels should be either present (or absent) for the node to be considered a fit for hosting the pod
presence[Required] bool
The boolean flag that indicates whether the labels should be present or absent from the node
LegacyExtender holds the parameters used to communicate with the extender. If a verb is unspecified/empty,
it is assumed that the extender chose not to provide that extension.
Field
Description
urlPrefix[Required] string
URLPrefix at which the extender is available
filterVerb[Required] string
Verb for the filter call, empty if not supported. This verb is appended to the URLPrefix when issuing the filter call to extender.
preemptVerb[Required] string
Verb for the preempt call, empty if not supported. This verb is appended to the URLPrefix when issuing the preempt call to extender.
prioritizeVerb[Required] string
Verb for the prioritize call, empty if not supported. This verb is appended to the URLPrefix when issuing the prioritize call to extender.
weight[Required] int64
The numeric multiplier for the node scores that the prioritize call generates.
The weight should be a positive integer
bindVerb[Required] string
Verb for the bind call, empty if not supported. This verb is appended to the URLPrefix when issuing the bind call to extender.
If this method is implemented by the extender, it is the extender's responsibility to bind the pod to apiserver. Only one extender
can implement this function.
enableHttps[Required] bool
EnableHTTPS specifies whether https should be used to communicate with the extender
HTTPTimeout specifies the timeout duration for a call to the extender. Filter timeout fails the scheduling of the pod. Prioritize
timeout is ignored, k8s/other extenders priorities are used to select the node.
nodeCacheCapable[Required] bool
NodeCacheCapable specifies that the extender is capable of caching node information,
so the scheduler should only send minimal information about the eligible nodes
assuming that the extender already cached full details of all nodes in the cluster
ManagedResources is a list of extended resources that are managed by
this extender.
- A pod will be sent to the extender on the Filter, Prioritize and Bind
(if the extender is the binder) phases iff the pod requests at least
one of the extended resources in this list. If empty or unspecified,
all pods will be sent to this extender.
- If IgnoredByScheduler is set to true for a resource, kube-scheduler
will skip checking the resource in predicates.
ignorable[Required] bool
Ignorable specifies if the extender is ignorable, i.e. scheduling should not
fail when the extender returns an error or is not reachable.
PredicateArgument represents the arguments to configure predicate functions in scheduler policy configuration.
Only one of its members may be specified
PredicatePolicy describes a struct of a predicate policy.
Field
Description
name[Required] string
Identifier of the predicate policy
For a custom predicate, the name can be user-defined
For the Kubernetes provided predicates, the name is the identifier of the pre-defined predicate
The priority function that ensures a good spread (anti-affinity) for pods belonging to a service
It uses a label to identify nodes that belong to the same "group"
PriorityPolicy describes a struct of a priority policy.
Field
Description
name[Required] string
Identifier of the priority policy
For a custom priority, the name can be user-defined
For the Kubernetes provided priority functions, the name is the identifier of the pre-defined priority function
weight[Required] int64
The numeric multiplier for the node scores that the priority function generates
The weight should be non-zero and can be a positive or a negative integer
KubeletConfiguration contains the configuration for the Kubelet
Field
Description
apiVersion string
kubelet.config.k8s.io/v1beta1
kind string
KubeletConfiguration
enableServer[Required] bool
enableServer enables Kubelet's secured server.
Note: Kubelet's insecure port is controlled by the readOnlyPort option.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may disrupt components that interact with the Kubelet server.
Default: true
staticPodPath string
staticPodPath is the path to the directory containing local (static) pods to
run, or the path to a single static pod file.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
the set of static pods specified at the new path may be different than the
ones the Kubelet initially started with, and this may disrupt your node.
Default: ""
syncFrequency is the max period between synchronizing running
containers and config.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
shortening this duration may have a negative performance impact, especially
as the number of Pods on the node increases. Alternatively, increasing this
duration will result in longer refresh times for ConfigMaps and Secrets.
Default: "1m"
fileCheckFrequency is the duration between checking config files for
new data
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
shortening the duration will cause the Kubelet to reload local Static Pod
configurations more frequently, which may have a negative performance impact.
Default: "20s"
httpCheckFrequency is the duration between checking http for new data
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
shortening the duration will cause the Kubelet to poll staticPodURL more
frequently, which may have a negative performance impact.
Default: "20s"
staticPodURL string
staticPodURL is the URL for accessing static pods to run
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
the set of static pods specified at the new URL may be different than the
ones the Kubelet initially started with, and this may disrupt your node.
Default: ""
staticPodURLHeader map[string][]string
staticPodURLHeader is a map of slices with HTTP headers to use when accessing the podURL
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may disrupt the ability to read the latest set of static pods from StaticPodURL.
Default: nil
address string
address is the IP address for the Kubelet to serve on (set to 0.0.0.0
for all interfaces).
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may disrupt components that interact with the Kubelet server.
Default: "0.0.0.0"
port int32
port is the port for the Kubelet to serve on.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may disrupt components that interact with the Kubelet server.
Default: 10250
readOnlyPort int32
readOnlyPort is the read-only port for the Kubelet to serve on with
no authentication/authorization.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may disrupt components that interact with the Kubelet server.
Default: 0 (disabled)
tlsCertFile string
tlsCertFile is the file containing x509 Certificate for HTTPS. (CA cert,
if any, concatenated after server cert). If tlsCertFile and
tlsPrivateKeyFile are not provided, a self-signed certificate
and key are generated for the public address and saved to the directory
passed to the Kubelet's --cert-dir flag.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may disrupt components that interact with the Kubelet server.
Default: ""
tlsPrivateKeyFile string
tlsPrivateKeyFile is the file containing x509 private key matching tlsCertFile
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may disrupt components that interact with the Kubelet server.
Default: ""
tlsCipherSuites []string
TLSCipherSuites is the list of allowed cipher suites for the server.
Values are from tls package constants (https://golang.org/pkg/crypto/tls/#pkg-constants).
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may disrupt components that interact with the Kubelet server.
Default: nil
tlsMinVersion string
TLSMinVersion is the minimum TLS version supported.
Values are from tls package constants (https://golang.org/pkg/crypto/tls/#pkg-constants).
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may disrupt components that interact with the Kubelet server.
Default: ""
rotateCertificates bool
rotateCertificates enables client certificate rotation. The Kubelet will request a
new certificate from the certificates.k8s.io API. This requires an approver to approve the
certificate signing requests.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
disabling it may disrupt the Kubelet's ability to authenticate with the API server
after the current certificate expires.
Default: false
serverTLSBootstrap bool
serverTLSBootstrap enables server certificate bootstrap. Instead of self
signing a serving certificate, the Kubelet will request a certificate from
the certificates.k8s.io API. This requires an approver to approve the
certificate signing requests. The RotateKubeletServerCertificate feature
must be enabled.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
disabling it will stop the renewal of Kubelet server certificates, which can
disrupt components that interact with the Kubelet server in the long term,
due to certificate expiration.
Default: false
authentication specifies how requests to the Kubelet's server are authenticated
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may disrupt components that interact with the Kubelet server.
Defaults:
anonymous:
enabled: false
webhook:
enabled: true
cacheTTL: "2m"
authorization specifies how requests to the Kubelet's server are authorized
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may disrupt components that interact with the Kubelet server.
Defaults:
mode: Webhook
webhook:
cacheAuthorizedTTL: "5m"
cacheUnauthorizedTTL: "30s"
registryPullQPS int32
registryPullQPS is the limit of registry pulls per second.
Set to 0 for no limit.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may impact scalability by changing the amount of traffic produced
by image pulls.
Default: 5
registryBurst int32
registryBurst is the maximum size of bursty pulls, temporarily allows
pulls to burst to this number, while still not exceeding registryPullQPS.
Only used if registryPullQPS > 0.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may impact scalability by changing the amount of traffic produced
by image pulls.
Default: 10
eventRecordQPS int32
eventRecordQPS is the maximum event creations per second. If 0, there
is no limit enforced.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may impact scalability by changing the amount of traffic produced by
event creations.
Default: 5
eventBurst int32
eventBurst is the maximum size of a burst of event creations, temporarily
allows event creations to burst to this number, while still not exceeding
eventRecordQPS. Only used if eventRecordQPS > 0.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may impact scalability by changing the amount of traffic produced by
event creations.
Default: 10
enableDebuggingHandlers bool
enableDebuggingHandlers enables server endpoints for log access
and local running of containers and commands, including the exec,
attach, logs, and portforward features.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
disabling it may disrupt components that interact with the Kubelet server.
Default: true
enableContentionProfiling bool
enableContentionProfiling enables lock contention profiling, if enableDebuggingHandlers is true.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
enabling it may carry a performance impact.
Default: false
healthzPort int32
healthzPort is the port of the localhost healthz endpoint (set to 0 to disable)
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may disrupt components that monitor Kubelet health.
Default: 10248
healthzBindAddress string
healthzBindAddress is the IP address for the healthz server to serve on
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may disrupt components that monitor Kubelet health.
Default: "127.0.0.1"
oomScoreAdj int32
oomScoreAdj is The oom-score-adj value for kubelet process. Values
must be within the range [-1000, 1000].
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may impact the stability of nodes under memory pressure.
Default: -999
clusterDomain string
clusterDomain is the DNS domain for this cluster. If set, kubelet will
configure all containers to search this domain in addition to the
host's search domains.
Dynamic Kubelet Config (beta): Dynamically updating this field is not recommended,
as it should be kept in sync with the rest of the cluster.
Default: ""
clusterDNS []string
clusterDNS is a list of IP addresses for the cluster DNS server. If set,
kubelet will configure all containers to use this for DNS resolution
instead of the host's DNS servers.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
changes will only take effect on Pods created after the update. Draining
the node is recommended before changing this field.
Default: nil
streamingConnectionIdleTimeout is the maximum time a streaming connection
can be idle before the connection is automatically closed.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may impact components that rely on infrequent updates over streaming
connections to the Kubelet server.
Default: "4h"
nodeStatusUpdateFrequency is the frequency that kubelet computes node
status. If node lease feature is not enabled, it is also the frequency that
kubelet posts node status to master.
Note: When node lease feature is not enabled, be cautious when changing the
constant, it must work with nodeMonitorGracePeriod in nodecontroller.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may impact node scalability, and also that the node controller's
nodeMonitorGracePeriod must be set to N∗NodeStatusUpdateFrequency,
where N is the number of retries before the node controller marks
the node unhealthy.
Default: "10s"
nodeStatusReportFrequency is the frequency that kubelet posts node
status to master if node status does not change. Kubelet will ignore this
frequency and post node status immediately if any change is detected. It is
only used when node lease feature is enabled. nodeStatusReportFrequency's
default value is 1m. But if nodeStatusUpdateFrequency is set explicitly,
nodeStatusReportFrequency's default value will be set to
nodeStatusUpdateFrequency for backward compatibility.
Default: "1m"
nodeLeaseDurationSeconds int32
nodeLeaseDurationSeconds is the duration the Kubelet will set on its corresponding Lease,
when the NodeLease feature is enabled. This feature provides an indicator of node
health by having the Kubelet create and periodically renew a lease, named after the node,
in the kube-node-lease namespace. If the lease expires, the node can be considered unhealthy.
The lease is currently renewed every 10s, per KEP-0009. In the future, the lease renewal interval
may be set based on the lease duration.
Requires the NodeLease feature gate to be enabled.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
decreasing the duration may reduce tolerance for issues that temporarily prevent
the Kubelet from renewing the lease (e.g. a short-lived network issue).
Default: 40
imageMinimumGCAge is the minimum age for an unused image before it is
garbage collected.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may trigger or delay garbage collection, and may change the image overhead
on the node.
Default: "2m"
imageGCHighThresholdPercent int32
imageGCHighThresholdPercent is the percent of disk usage after which
image garbage collection is always run. The percent is calculated as
this field value out of 100.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may trigger or delay garbage collection, and may change the image overhead
on the node.
Default: 85
imageGCLowThresholdPercent int32
imageGCLowThresholdPercent is the percent of disk usage before which
image garbage collection is never run. Lowest disk usage to garbage
collect to. The percent is calculated as this field value out of 100.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may trigger or delay garbage collection, and may change the image overhead
on the node.
Default: 80
How frequently to calculate and cache volume disk usage for all pods
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
shortening the period may carry a performance impact.
Default: "1m"
kubeletCgroups string
kubeletCgroups is the absolute name of cgroups to isolate the kubelet in
Dynamic Kubelet Config (beta): This field should not be updated without a full node
reboot. It is safest to keep this value the same as the local config.
Default: ""
systemCgroups string
systemCgroups is absolute name of cgroups in which to place
all non-kernel processes that are not already in a container. Empty
for no container. Rolling back the flag requires a reboot.
Dynamic Kubelet Config (beta): This field should not be updated without a full node
reboot. It is safest to keep this value the same as the local config.
Default: ""
cgroupRoot string
cgroupRoot is the root cgroup to use for pods. This is handled by the
container runtime on a best effort basis.
Dynamic Kubelet Config (beta): This field should not be updated without a full node
reboot. It is safest to keep this value the same as the local config.
Default: ""
cgroupsPerQOS bool
Enable QoS based Cgroup hierarchy: top level cgroups for QoS Classes
And all Burstable and BestEffort pods are brought up under their
specific top level QoS cgroup.
Dynamic Kubelet Config (beta): This field should not be updated without a full node
reboot. It is safest to keep this value the same as the local config.
Default: true
cgroupDriver string
driver that the kubelet uses to manipulate cgroups on the host (cgroupfs or systemd)
Dynamic Kubelet Config (beta): This field should not be updated without a full node
reboot. It is safest to keep this value the same as the local config.
Default: "cgroupfs"
cpuManagerPolicy string
CPUManagerPolicy is the name of the policy to use.
Requires the CPUManager feature gate to be enabled.
Dynamic Kubelet Config (beta): This field should not be updated without a full node
reboot. It is safest to keep this value the same as the local config.
Default: "none"
CPU Manager reconciliation period.
Requires the CPUManager feature gate to be enabled.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
shortening the period may carry a performance impact.
Default: "10s"
topologyManagerPolicy string
TopologyManagerPolicy is the name of the policy to use.
Policies other than "none" require the TopologyManager feature gate to be enabled.
Dynamic Kubelet Config (beta): This field should not be updated without a full node
reboot. It is safest to keep this value the same as the local config.
Default: "none"
topologyManagerScope string
TopologyManagerScope represents the scope of topology hint generation
that topology manager requests and hint providers generate.
"pod" scope requires the TopologyManager feature gate to be enabled.
Default: "container"
qosReserved map[string]string
qosReserved is a set of resource name to percentage pairs that specify
the minimum percentage of a resource reserved for exclusive use by the
guaranteed QoS tier.
Currently supported resources: "memory"
Requires the QOSReserved feature gate to be enabled.
Dynamic Kubelet Config (beta): This field should not be updated without a full node
reboot. It is safest to keep this value the same as the local config.
Default: nil
runtimeRequestTimeout is the timeout for all runtime requests except long running
requests - pull, logs, exec and attach.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may disrupt components that interact with the Kubelet server.
Default: "2m"
hairpinMode string
hairpinMode specifies how the Kubelet should configure the container
bridge for hairpin packets.
Setting this flag allows endpoints in a Service to loadbalance back to
themselves if they should try to access their own Service. Values:
"promiscuous-bridge": make the container bridge promiscuous.
"hairpin-veth": set the hairpin flag on container veth interfaces.
"none": do nothing.
Generally, one must set --hairpin-mode=hairpin-veth to achieve hairpin NAT,
because promiscuous-bridge assumes the existence of a container bridge named cbr0.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may require a node reboot, depending on the network plugin.
Default: "promiscuous-bridge"
maxPods int32
maxPods is the number of pods that can run on this Kubelet.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
changes may cause Pods to fail admission on Kubelet restart, and may change
the value reported in Node.Status.Capacity[v1.ResourcePods], thus affecting
future scheduling decisions. Increasing this value may also decrease performance,
as more Pods can be packed into a single node.
Default: 110
podCIDR string
The CIDR to use for pod IP addresses, only used in standalone mode.
In cluster mode, this is obtained from the master.
Dynamic Kubelet Config (beta): This field should always be set to the empty default.
It should only set for standalone Kubelets, which cannot use Dynamic Kubelet Config.
Default: ""
podPidsLimit int64
PodPidsLimit is the maximum number of pids in any pod.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
lowering it may prevent container processes from forking after the change.
Default: -1
resolvConf string
ResolverConfig is the resolver configuration file used as the basis
for the container DNS resolution configuration.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
changes will only take effect on Pods created after the update. Draining
the node is recommended before changing this field.
Default: "/etc/resolv.conf"
runOnce bool
RunOnce causes the Kubelet to check the API server once for pods,
run those in addition to the pods specified by static pod files, and exit.
Default: false
cpuCFSQuota bool
cpuCFSQuota enables CPU CFS quota enforcement for containers that
specify CPU limits.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
disabling it may reduce node stability.
Default: true
CPUCFSQuotaPeriod is the CPU CFS quota period value, cpu.cfs_period_us.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
limits set for containers will result in different cpu.cfs_quota settings. This
will trigger container restarts on the node being reconfigured.
Default: "100ms"
nodeStatusMaxImages int32
nodeStatusMaxImages caps the number of images reported in Node.Status.Images.
Note: If -1 is specified, no cap will be applied. If 0 is specified, no image is returned.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
different values can be reported on node status.
Default: 50
maxOpenFiles int64
maxOpenFiles is Number of files that can be opened by Kubelet process.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may impact the ability of the Kubelet to interact with the node's filesystem.
Default: 1000000
contentType string
contentType is contentType of requests sent to apiserver.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may impact the ability for the Kubelet to communicate with the API server.
If the Kubelet loses contact with the API server due to a change to this field,
the change cannot be reverted via dynamic Kubelet config.
Default: "application/vnd.kubernetes.protobuf"
kubeAPIQPS int32
kubeAPIQPS is the QPS to use while talking with kubernetes apiserver
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may impact scalability by changing the amount of traffic the Kubelet
sends to the API server.
Default: 5
kubeAPIBurst int32
kubeAPIBurst is the burst to allow while talking with kubernetes apiserver
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may impact scalability by changing the amount of traffic the Kubelet
sends to the API server.
Default: 10
serializeImagePulls bool
serializeImagePulls when enabled, tells the Kubelet to pull images one
at a time. We recommend ∗not∗ changing the default value on nodes that
run docker daemon with version < 1.9 or an Aufs storage backend.
Issue #10959 has more details.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may impact the performance of image pulls.
Default: true
evictionHard map[string]string
Map of signal names to quantities that defines hard eviction thresholds. For example: {"memory.available": "300Mi"}.
To explicitly disable, pass a 0% or 100% threshold on an arbitrary resource.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may trigger or delay Pod evictions.
Default:
memory.available: "100Mi"
nodefs.available: "10%"
nodefs.inodesFree: "5%"
imagefs.available: "15%"
evictionSoft map[string]string
Map of signal names to quantities that defines soft eviction thresholds.
For example: {"memory.available": "300Mi"}.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may trigger or delay Pod evictions, and may change the allocatable reported
by the node.
Default: nil
evictionSoftGracePeriod map[string]string
Map of signal names to quantities that defines grace periods for each soft eviction signal.
For example: {"memory.available": "30s"}.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may trigger or delay Pod evictions.
Default: nil
Duration for which the kubelet has to wait before transitioning out of an eviction pressure condition.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
lowering it may decrease the stability of the node when the node is overcommitted.
Default: "5m"
evictionMaxPodGracePeriod int32
Maximum allowed grace period (in seconds) to use when terminating pods in
response to a soft eviction threshold being met. This value effectively caps
the Pod's TerminationGracePeriodSeconds value during soft evictions.
Note: Due to issue #64530, the behavior has a bug where this value currently just
overrides the grace period during soft eviction, which can increase the grace
period from what is set on the Pod. This bug will be fixed in a future release.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
lowering it decreases the amount of time Pods will have to gracefully clean
up before being killed during a soft eviction.
Default: 0
evictionMinimumReclaim map[string]string
Map of signal names to quantities that defines minimum reclaims, which describe the minimum
amount of a given resource the kubelet will reclaim when performing a pod eviction while
that resource is under pressure. For example: {"imagefs.available": "2Gi"}
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may change how well eviction can manage resource pressure.
Default: nil
podsPerCore int32
podsPerCore is the maximum number of pods per core. Cannot exceed MaxPods.
If 0, this field is ignored.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
changes may cause Pods to fail admission on Kubelet restart, and may change
the value reported in Node.Status.Capacity[v1.ResourcePods], thus affecting
future scheduling decisions. Increasing this value may also decrease performance,
as more Pods can be packed into a single node.
Default: 0
enableControllerAttachDetach bool
enableControllerAttachDetach enables the Attach/Detach controller to
manage attachment/detachment of volumes scheduled to this node, and
disables kubelet from executing any attach/detach operations
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
changing which component is responsible for volume management on a live node
may result in volumes refusing to detach if the node is not drained prior to
the update, and if Pods are scheduled to the node before the
volumes.kubernetes.io/controller-managed-attach-detach annotation is updated by the
Kubelet. In general, it is safest to leave this value set the same as local config.
Default: true
protectKernelDefaults bool
protectKernelDefaults, if true, causes the Kubelet to error if kernel
flags are not as it expects. Otherwise the Kubelet will attempt to modify
kernel flags to match its expectation.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
enabling it may cause the Kubelet to crash-loop if the Kernel is not configured as
Kubelet expects.
Default: false
makeIPTablesUtilChains bool
If true, Kubelet ensures a set of iptables rules are present on host.
These rules will serve as utility rules for various components, e.g. KubeProxy.
The rules will be created based on IPTablesMasqueradeBit and IPTablesDropBit.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
disabling it will prevent the Kubelet from healing locally misconfigured iptables rules.
Default: true
iptablesMasqueradeBit int32
iptablesMasqueradeBit is the bit of the iptables fwmark space to mark for SNAT
Values must be within the range [0, 31]. Must be different from other mark bits.
Warning: Please match the value of the corresponding parameter in kube-proxy.
TODO: clean up IPTablesMasqueradeBit in kube-proxy
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it needs to be coordinated with other components, like kube-proxy, and the update
will only be effective if MakeIPTablesUtilChains is enabled.
Default: 14
iptablesDropBit int32
iptablesDropBit is the bit of the iptables fwmark space to mark for dropping packets.
Values must be within the range [0, 31]. Must be different from other mark bits.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it needs to be coordinated with other components, like kube-proxy, and the update
will only be effective if MakeIPTablesUtilChains is enabled.
Default: 15
featureGates map[string]bool
featureGates is a map of feature names to bools that enable or disable alpha/experimental
features. This field modifies piecemeal the built-in default values from
"k8s.io/kubernetes/pkg/features/kube_features.go".
Dynamic Kubelet Config (beta): If dynamically updating this field, consider the
documentation for the features you are enabling or disabling. While we
encourage feature developers to make it possible to dynamically enable
and disable features, some changes may require node reboots, and some
features may require careful coordination to retroactively disable.
Default: nil
failSwapOn bool
failSwapOn tells the Kubelet to fail to start if swap is enabled on the node.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
setting it to true will cause the Kubelet to crash-loop if swap is enabled.
Default: true
containerLogMaxSize string
A quantity defines the maximum size of the container log file before it is rotated.
For example: "5Mi" or "256Ki".
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may trigger log rotation.
Default: "10Mi"
containerLogMaxFiles int32
Maximum number of container log files that can be present for a container.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
lowering it may cause log files to be deleted.
Default: 5
ConfigMapAndSecretChangeDetectionStrategy is a mode in which
config map and secret managers are running.
Default: "Watch"
systemReserved map[string]string
systemReserved is a set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=150G)
pairs that describe resources reserved for non-kubernetes components.
Currently only cpu and memory are supported.
See http://kubernetes.io/docs/user-guide/compute-resources for more detail.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may not be possible to increase the reserved resources, because this
requires resizing cgroups. Always look for a NodeAllocatableEnforced event
after updating this field to ensure that the update was successful.
Default: nil
kubeReserved map[string]string
A set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=150G) pairs
that describe resources reserved for kubernetes system components.
Currently cpu, memory and local storage for root file system are supported.
See http://kubernetes.io/docs/user-guide/compute-resources for more detail.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may not be possible to increase the reserved resources, because this
requires resizing cgroups. Always look for a NodeAllocatableEnforced event
after updating this field to ensure that the update was successful.
Default: nil
reservedSystemCPUs[Required] string
This ReservedSystemCPUs option specifies the cpu list reserved for the host level system threads and kubernetes related threads.
This provide a "static" CPU list rather than the "dynamic" list by system-reserved and kube-reserved.
This option overwrites CPUs provided by system-reserved and kube-reserved.
showHiddenMetricsForVersion string
The previous version for which you want to show hidden metrics.
Only the previous minor version is meaningful, other values will not be allowed.
The format is ., e.g.: '1.16'.
The purpose of this format is make sure you have the opportunity to notice if the next release hides additional metrics,
rather than being surprised when they are permanently removed in the release after that.
Default: ""
systemReservedCgroup string
This flag helps kubelet identify absolute name of top level cgroup used to enforce `SystemReserved` compute resource reservation for OS system daemons.
Refer to [Node Allocatable](https://git.k8s.io/community/contributors/design-proposals/node/node-allocatable.md) doc for more information.
Dynamic Kubelet Config (beta): This field should not be updated without a full node
reboot. It is safest to keep this value the same as the local config.
Default: ""
kubeReservedCgroup string
This flag helps kubelet identify absolute name of top level cgroup used to enforce `KubeReserved` compute resource reservation for Kubernetes node system daemons.
Refer to [Node Allocatable](https://git.k8s.io/community/contributors/design-proposals/node/node-allocatable.md) doc for more information.
Dynamic Kubelet Config (beta): This field should not be updated without a full node
reboot. It is safest to keep this value the same as the local config.
Default: ""
enforceNodeAllocatable []string
This flag specifies the various Node Allocatable enforcements that Kubelet needs to perform.
This flag accepts a list of options. Acceptable options are `none`, `pods`, `system-reserved` & `kube-reserved`.
If `none` is specified, no other options may be specified.
Refer to [Node Allocatable](https://git.k8s.io/community/contributors/design-proposals/node/node-allocatable.md) doc for more information.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
removing enforcements may reduce the stability of the node. Alternatively, adding
enforcements may reduce the stability of components which were using more than
the reserved amount of resources; for example, enforcing kube-reserved may cause
Kubelets to OOM if it uses more than the reserved resources, and enforcing system-reserved
may cause system daemons to OOM if they use more than the reserved resources.
Default: ["pods"]
allowedUnsafeSysctls []string
A comma separated whitelist of unsafe sysctls or sysctl patterns (ending in ∗).
Unsafe sysctl groups are kernel.shm∗, kernel.msg∗, kernel.sem, fs.mqueue.∗, and net.∗.
These sysctls are namespaced but not allowed by default. For example: "kernel.msg∗,net.ipv4.route.min_pmtu"
Default: []
volumePluginDir string
volumePluginDir is the full path of the directory in which to search
for additional third party volume plugins.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that changing
the volumePluginDir may disrupt workloads relying on third party volume plugins.
Default: "/usr/libexec/kubernetes/kubelet-plugins/volume/exec/"
providerID string
providerID, if set, sets the unique id of the instance that an external provider (i.e. cloudprovider)
can use to identify a specific node.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may impact the ability of the Kubelet to interact with cloud providers.
Default: ""
kernelMemcgNotification bool
kernelMemcgNotification, if set, the kubelet will integrate with the kernel memcg notification
to determine if memory eviction thresholds are crossed rather than polling.
Dynamic Kubelet Config (beta): If dynamically updating this field, consider that
it may impact the way Kubelet interacts with the kernel.
Default: false
Logging specifies the options of logging.
Refer [Logs Options](https://github.com/kubernetes/component-base/blob/master/logs/options.go) for more information.
Defaults:
Format: text
enableSystemLogHandler bool
enableSystemLogHandler enables system logs via web interface host:port/logs/
Default: true
ShutdownGracePeriod specifies the total duration that the node should delay the shutdown and total grace period for pod termination during a node shutdown.
Default: "30s"
ShutdownGracePeriodCriticalPods specifies the duration used to terminate critical pods during a node shutdown. This should be less than ShutdownGracePeriod.
For example, if ShutdownGracePeriod=30s, and ShutdownGracePeriodCriticalPods=10s, during a node shutdown the first 20 seconds would be reserved for gracefully terminating normal pods, and the last 10 seconds would be reserved for terminating critical pods.
Default: "10s"
SerializedNodeConfigSource
SerializedNodeConfigSource allows us to serialize v1.NodeConfigSource.
This type is used internally by the Kubelet for tracking checkpointed dynamic configs.
It exists in the kubeletconfig API group because it is classified as a versioned input to the Kubelet.
enabled allows anonymous requests to the kubelet server.
Requests that are not rejected by another authentication method are treated as anonymous requests.
Anonymous requests have a username of system:anonymous, and a group name of system:unauthenticated.
mode is the authorization mode to apply to requests to the kubelet server.
Valid values are AlwaysAllow and Webhook.
Webhook mode uses the SubjectAccessReview API to determine authorization.
clientCAFile is the path to a PEM-encoded certificate bundle. If set, any request presenting a client certificate
signed by one of the authorities in the bundle is authenticated with a username corresponding to the CommonName,
and groups corresponding to the Organization in the client certificate.
LoggingConfiguration contains logging options
Refer Logs Options for more information.
Field
Description
format[Required] string
Format Flag specifies the structure of log messages.
default value of format is `text`
sanitization[Required] bool
[Experimental] When enabled prevents logging of fields tagged as sensitive (passwords, keys, tokens).
Runtime log sanitization may introduce significant computation overhead and therefore should not be enabled in production.`)
将给定问题的范围限定在一个工作单位范围内。如果问题牵涉的领域较大,可以将其分解为多个
小一点的问题。例如:"Fix the security docs" 是一个过于宽泛的问题,而
"Add details to the 'Restricting network access' topic"
就是一个足够具体的、可操作的问题。
对于很多 SIG Docs 共同参与的,需较长时间才完成的任务,例如内容的重构,
请使用为该任务创建的特性分支。
如果你在选择分支上需要帮助,请在 #sig-docs Slack 频道提问。
基于第一步中选定的分支,创建新分支。
下面的例子假定基础分支是 upstream/master:
git checkout -b <my_new_branch> upstream/master
使用文本编辑器开始构造变更。
在任何时候,都可以使用 git status 命令查看你所改变了的文件列表。
提交你的变更
当你准备好发起拉取请求(PR)时,提交你所做的变更。
在你的本地仓库中,检查你要提交的文件:
git status
输出类似于:
On branch <my_new_branch>
Your branch is up to date with 'origin/<my_new_branch>'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)(use "git checkout -- <file>..." to discard changes in working directory)
modified: content/en/docs/contribute/new-content/contributing-content.md
no changes added to commit (use "git add" and/or "git commit -a")
压缩提交的过程也是一种重设基线的过程。
这里的 -i 开关告诉 git 你希望交互式地执行重设基线操作。
HEAD~<number_of_commits_in_branch 表明在 rebase 操作中查看多少个提交。
输出类似于;
pick d875112ca Original commit
pick 4fa167b80 Address feedback 1
pick 7d54e15ee Address feedback 2# Rebase 3d18sf680..7d54e15ee onto 3d183f680 (3 commands)
...
# These lines can be re-ordered; they are executed from top to bottom.
This issue sounds more like a request for support and less
like an issue specifically for docs. I encourage you to bring
your question to the `#kubernetes-users` channel in
[Kubernetes slack](https://slack.k8s.io/). You can also search
resources like
[Stack Overflow](https://stackoverflow.com/questions/tagged/kubernetes)
for answers to similar questions.
You can also open issues for Kubernetes functionality in
https://github.com/kubernetes/kubernetes.
If this is a documentation issue, please re-open this issue.
对代码缺陷 Issue 的回复示例:
This sounds more like an issue with the code than an issue with
the documentation. Please open an issue at
https://github.com/kubernetes/kubernetes/issues.
If this is a documentation issue, please re-open this issue.
# See the OWNERS docs at https://go.k8s.io/owners# This is the localization project for Spanish.# Teams and members are visible at https://github.com/orgs/kubernetes/teams.reviewers:- sig-docs-es-reviewsapprovers:- sig-docs-es-ownerslabels:- language/es
非主分支的 PR: If the PR is against a dev- branch, it's for an upcoming release. Assign the docs release manager using: /assign @<manager's_github-username>. If the PR is against an old branch, help the author figure out whether it's targeted against the best branch.
如果 PR 针对 dev- 分支,则表示它适用于即将发布的版本。
请添加带有 /assign @<负责人的 github 账号>,将其指派给
发行版本负责人。
如果 PR 是针对旧分支,请帮助 PR 作者确定是否所针对的是最合适的分支。
任务页面展示如何完成特定任务。其目的是给读者提供一系列的步骤,让他们在阅读时可以实际执行。任务页面可长可短,前提是它始终围绕着某个主题展开。在任务页面中,可以将简短的解释与要执行的步骤混合在一起。如果需要提供较长的解释,则应在概念主题中进行。相关联的任务和概念主题应该相互链接。一个简短的任务页面的实例可参见 配置 Pod 使用卷存储。一个较长的任务页面的实例可参见 配置活跃性和就绪性探针。
1. Prepare something
1. Followed by a long step with code snippets and notes ...
this is a really long item
1. Another long item ...
.. continues here
1. Almost done ...
本地化处理:
<!--
1. Prepare something
-->
1. 准备工作,...
这里每行缩进 3 个空格
<!--
1. Followed by a long step with code snippets and notes ...
this is a really long item
-->
2. 这里是第二个编号,但需要显式给出数字,不能沿用英文编号。
缩进内容同上,3 个空格。
即使有三个反引号的代码段或者短代码,都按 3 个空格缩进。
<!--
1. Another long item ...
.. continues here
1. Almost done ...
-->
3. 继续列表。
如果条目有多个段落,也要
保持缩进对齐以确保排版正确。
4. 列表终于结束
tasks/access-kubernetes-api/custom-resources/index.html
hash does not exist --- tasks/access-kubernetes-api/custom-resources/index.html --> #preserving-unknown-fields
hash does not exist --- tasks/access-kubernetes-api/custom-resources/index.html --> #preserving-unknown-fields
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.
内联文本风格
粗体字
斜体字
粗斜体字
删除线
下划线
带下划线的斜体
带下划线的粗斜体
monospace text <- 等宽字体
monospace bold <- 粗等宽字体
列表
Markdown 在如何处理列表方面没有严格的规则。在我们从 Jekyll 迁移到 Hugo 时,
我们遇到了一些问题。为了处理这些问题,请注意以下几点: